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is^on,  based  on  which  casks  the  crewmen  could  and  could  not  perform.  Reliabil 
ity^f  the  ratings  averaged  .68.  Ways  of  Improving  the  quality  of  task  criti- 
cality studies  were  discussed. 

Cluster  analysis  was  used  to  group  casks  by  crew  position  according  to 
similarities  among  descriptors  by  which  the  Casks  were  characterized.  Eighty 
Cask  clusters  or  "skills"  were  identified,  21  for  the  Driver,  I9  for  the 
Loader,  20  for  Che  Gunner,  and  20  for  the  Tank  Commander. 


Criticality  learning  difficulty  and  evaluation  difficulty  were  estimated 
for  each  Cask  cluster. 


Results  of  the  research  indicated  that:  (1)  The  task  analyses  and  the 
tas!;  criticality  studies  yielded  results  that  will  be  useful  for  assigning 
training  priorities;  (2)  the  cluster  analyses  produced  groups  of  tasks  which 
appear  reasonable.  Chough  the  implications  for  training  design  remain  to  be 
demonstrated;  and  (3)  results  of  Che  learning  and  evaluation  difficulty 
studies  were  inconclusive. 
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BRIEF 


This  report  describes  the  conduct  and  results  of  the  first  task  of 
a two-task  project  to  design  training  for  Armor  and  Cavalry  National 
Guard  units. 


REQUIREMENT 

The  requirement  to  which  Task  1 was  addressed  was  to  analyze 
tasks,  estimate  criticality,  and  perform  related  work  In  preparation 
for  designing  training  for  Reserve  Components^  that  use  the  M48A5  tank. 
The  objectives  to  be  achieved  during  this  preparatory  work  were  to:  . 

1.  Generate  and  organize  task  data  for  the 
H48A5,  M60A1,  M60A3,  and  XM-1  tanks. 

2.  Identify  tasks  that  are  coinnon  and  unique 
to  the  MA8A5,  M60A1,  and  M60A3. 

3.  Use  a paired-comparison  technique  to 
estimate  the  relative  criticality  of  tasks 
for  each  of  the  three  tanks. 

4.  Establish  the  reliability  of  the  task 
criticality  estimates. 

5.  Prepare  plans  for  investigating  the 
validity  of  the  criticality  estimates. 

6.  Use  cluster  analysis  to  group  tasks  Into 
"skills,"  according  to  descriptors  that 
have  Implications  for  training  design. 

7.  Estimate  the  criticality,  and  the  diffi- 
culty of  learning  and  evaluating  each  of 
the  task  groups  or  "skills"  Identified  as 
the  result  of  item  6,  above. 


PROCEDURE  AND  RESULTS 

Achieving  the  objectives  listed  above  was  described  In  four  parts: 

1.  Generating  and  Organizing.  Task  Data. 

2.  Task  Criticality. 

3.  Cluster  Analysis. 

4.  Skill  Criticality,  Learning  Difficulty, 
and  Evaluation  Difficulty. 


^"Reserve  Components"  as  used  In  this  report,  refer  to  National  Guard  and 
U.S.  Army  Reserve  units.  With  few  exceptions,  the  only  Reserve  CoDq>onent8 
that  are  using  or  scheduled  to  use  the  M48A5  tank  are  Armor  and  Cavalry 
National  Guard  units. 


Generating  and  Organizing  Task  Data 


The  project  began  with  generating  and  organizing  task  data  for  the 
tank  systems.  Data  sources  Included  task  dat|i  cards  from  the  U.S.  Army 
Armor  School,  research  reports,  operators'  and  equipment  manuals,  and 
task  lists  generated  by  the  project  staff.  The  task  data  were  presented 
separately  for  each  duty  position  In  a form  that  shows  which  tasks  are 
common  and  unique  to  the  M48A5,  M60A1,  and  M60A3.^ 

Task  Criticality 

Task  criticality  was  estimated  using  a paired  comparison  study.  Forty- 
eight  AOAC  (Armor  Officers'  Advanced  Course)  students  selected  hypotheti- 
cal crewmen  for  a combat  mission,  based  on  which  tasks  the  crewmen  could 
and  could  not  perform.  The  assumption  here  was  that  the  officers' 
perceptions  of  task  criticality  would  be  reflected  In  their  choices  of 
crewmen  to  take  Into  combat.  The  study  yielded  numerical  Indexes  of 
criticality  for  each  task. 

The  tasks  receiving  the  highest  criticality  ratings  were  those  that 
would  be  expected  by  one  familiar  with  tank  operations:  the  Tank 
Commander  acquiring  targets,  the  Tank  Commander  and  Gunner  firing  the 
mala  gun,  the  Loader  loading,  and  the  Driver  driving  tactically. 


The  reliability  of  the  paired  comparison  judgments  was  estimated  by 
correlating  the  scale  values  of  tasks  common  to  the  three  tanks.  Correla- 
tions, computed  by  duty  position  for  each  pair  of  tanks,  ranged  from  .55 
to  .79,  with  an  average  of  .68.  All  were  statistically  significant  (p  < .05). 

Suggestions  were  offered  as  to  how  Inter-rater  reliability  might  be 
Increased  In  future  studies  of  task  criticality  with  the  paired  comparison 
technique : 

1.  Increase  the  precision  of  defining  the  para- 
meters on  which  judgments  are  to  be  made. 

2.  Provide  opportunity  for  rater  practice. 


^Deta  for  the  XM-1  were  submitted  under  separate  cover.  They  were 
not  used  In  later  analyses  because  they  were  preliminary  and  subject 
to  change. 
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3.  Use  complete,  as  opposed  to  partial, 
pairing  designs. 

4.  Increase  the  number  of  observations  per 
paired  comparison. 

A plan  was  presented  for  examining  the  construct  validity  of  the 
criticality  estimates.  Issues  associated  with  the  content  and  predic- 
tive validity  of  criticality  measurement  also  were  discussed. 

Cluster  Analysis 

t 

I Cluster  analysis  was  used  to  group  tasks  according  to  similarities 

I I among  descriptors  by  which  the  tasks  were  characterized.  The  exercise 

I began  with  a search  for  a set  of  descriptors  which  could  be  used  to 

characterize  all  armor  tasks,  and  which  might  have  Implications  for 
training  design.  Thirty-six  descriptors  were  selected  and  used.  Eleven 
' of  the  36  describe  stimuli  that  initiate  and  maintain  task  performance; 

written  materials  and  oral  commands  are  examples.  Six  of  the  descrip- 
tors pertain  to  the  tools.  Instruments,  and  controls  that  are  used  In 
task  performance;  variable  setting  controls,  for  example,  and  common 
hand  tools.  Eleven  descriptors  pertain  to  the  mediating  processes 
Involved  in  task  performance;  using  rules,  for  example,  and  recalling 
set  procedures.  The  remaining  eight  descriptors  describe  overt 
responses;  finger  manipulation,  for  example,  and  reporting  In  writing. 

The  36  descriptors  were  arrayed  across  the  tops  of  data  recording 
forms,  with  tasks  and  subtasks  listed  down  the  left  margin.  Two  mem- 
bers of  the  project  staff  Independently  filled  In  the  data  tables, 
entering  a "1"  In  the  columns  corresponding  to  descriptors  that  char- 
acterized each  subtask,  and  leaving  blank  the  descriptor  columns  that 
did  not  pertain  to  the  subtask.  The  two  sets  of  one-zero  data  thus 
generated  served  as  the  Inputs  for  the  Inter-rater  reliability  studies 
that  followed . 
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Inter- rater  reliability  was  examined  by  computing  phi  coeffic- 
ients for  each  of  the  four  descriptor  subsets  (Stimuli;  Tools,  Instru- 
ments, and  Controls:  Mediating  Process;  and  Overt  Responses),  and  across 
subsets,  both  before  and  after  rater  practice.  Doing  so  permitted 
examining  not  only  inter-rater  reliability,  but  also  the  effects  of 
practice  on  inter-rater  reliability. 

Inter-rater  reliability  Increased  significantly  with  practice  and 
discussion,  irrespective  of  whether  the  tasks  rated  after  practice 
were  the  same  as  or  different  from  the  tasks  rated  for  practice.  Over- 
all inter-rater  reliabilities  for  the  tasks  rated  after  practice  were 
about  .70. 

After  inter-rater  reliability  was  examined,  the  two  raters  discussed 
their  ratings,  and  produced  a single,  reconciled,  task  by  task-descriptor 
matrix,  which  was  the  input  for  the  cluster  analyses. 


The  results  of  four  cluster  analyses,  one  for  each  dut/  position 
across  the  three  tank  systems,  were  presented.  Eighty  task  clusters  or 
"skills"  were  identified,  21  f,or  the  Driver,  19  for  the  Loader,  20  for 
the  Gunner,  and  20  for  the  Tank  Commander.  Examples  of  the  skills  for 
each  duty  position  are: 

1.  Driver  (M60A1,  MA8A5,  M60A3) , Perform  Tank 
Operation  Procedures:  Performs  fixed 
procedure  multi-limb  manipulation  of 
various  controls  in  response  to  oral  commands. 

2.  Loader  (M60A1,  H48A5,  M60A3),  Perform 
Tactical  Loading:  Performs  fixed  procedure 
finger-hand-arm  manipulation  of  various  con- 
trols in  response  to  oral  commands  by  recall- 
ing information;  reports  by  talking. 

3.  Gunner  (M60A1,  M48A5,  M60A3) , Perform  Misfire 
Procedures:  Performs  fixed  procedure  finger- 
hand-arm  manipulation  of  various  controls  in 
voluntary  response  to  non-verbal  sounds  and 
body-feel  while  comaunlcating  orally. 
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4.  Tank  Commander  (M60A1,  M48A5,  M60A3) , Bore- 
sight  and  zero  weapons;  Performs  continuous 
and  fixed  procedure  finger— hand-arm  manipula- 
tion of  various  controls  and  sometimes  common 
hand  tools  in  voluntary  response  to  man-made 
environmental  features.  Instrument  read-outs 
and  sometimes  touch  by  recalling  facts  and 
classl'fylng  information;  reports  by  talking. 

The  tasks  comprising  each  of  the  80  task  clusters  are  listed  by  duty 

positions  in  Appendix  B. 


Skill  Criticality.  Learning.  Difficulty,  and  Evaluation  Difficulty 
Skill  criticality,  the  mean  of  the  criticality  scores  for  the 
tasks  comprising  each  of  the  80  task  clusters,  was  judged  not  par- 
ticularly useful  for  training  design. 


Learning  difficulty  and  evaluation  difficulty  for  the  domain  of 
tank  crew  behavior  associated  with  each  task  descriptor  were  rated 
by  five  members  of  the  project  staff.  The  estimates  for  each  descrip- 
tor were  averaged  across  raters.  Difficulty  estimates  for  each  skill 
were  then  made  by  assigning  the  descriptor  scores  to  the  modal 
descriptor  pattern  for  each  skill. 


The  estimates  of  learning  and  evaluation  difficulty  were  highly 
reliable  (.76  and  .88)  in  terms  of  the  stability  of  the  mean  ratings 
obtained.  The  results  were,  however,  judged  Inconclusive,  because  some 
seemed  at  odds  with  reality.  The  Driver's  cluster,  "Start  Tank  Engine," 
for  example,  received  an  extremely  high  difficulty  rating.  The  apparent 
abberations  may  have  been  the  result  of  deficiencies  in  the  methods 
for  computing  difficulty,  inappropriate  naming  of  some  clusters,  or  both. 

Suggestions  were  made  for  examining  the  construct  validity  of  learn- 
ing and  evaluation  difficulty  using  designs  similar  to  the  one  presented 
for  criticality  (Appendix  F).  Construct  validity  was  tentatively 
examined  in  light  of  correlations  between  learning  and  evaluation  difficulty 
(r  ■ .76),  and  between  each  of  the  difficulty  estimates  and  criticality 
(r  ■ .44  in  both  cases). 
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USE  OF  FINDINGS 


The  results  reported  here  are  intended  to  be  used  during  Task  2 
to  design  training  for  Reserve  Coiiq>onents  that  use  the  M48A5  tank. 

The  task  analyses  and  the  task  criticality  studies  yielded  results 
that  will  be  useful  for  assigning  training  priorities.  The  cluster 
analyses  produced  reasonable-appearing  groups  of  tasks,  though  the 
laipllcatlons  for  training  design  remain  to  be  demonstrated.  The 
results  of  the  learning  and  evaluation  difficulty  studies  were  incon- 
clusive, and  will  not  be  used. 


PREFACE 


This  is  the  Final  Report  for  Task  1 of  a two-task  project  entitled 
"Tank  Systems  Skills  and  Training  Stinicture."  The  report  describes 
task-analytic  and  related  work  done  in  preparation  for  developing  train- 
ing outlines  for  Reserve  Components  that  use  the  M48A5  tank. 

The  work  reported  in  this  volume  was  performed  at  the  Fort  Knox 
Office  of  the  Human  Resources  Research  Organization  (HumRRO) , under 
Contract  No.  DAHC-19-76-C-0001  with  the  U.S.  Army  Research  Institute 
for  the  Behavioral  and  Social  Sciences  (ARl) . 

John  A.  Boldovlci  is  directing  the  project,  which  is  staffed  by 
Roy  C.  Campbell,  J.  Patrick  Ford,  James  H.  Harris,  Charlotte  L.  Helnecke, 
Richard  E.  O'Brien,  and  William  C.  Osborn. 

Paul  W.  Flngerman,  Andrw  M.  Rose,  and  George  R.  Wheaton  of  the 
American  Institutes  for  Research  assisted  8«d>8tantially  in  interpreting 
~the  results  of  the  cluster  analysis  under  a subcontract  with  HumRRO. 

Donald  F.  Haggard,  the  Contracting  Officer's  Technical  Representa- 
tive, provided  administrative  assistance,  valuable  criticism,  and  sub- 
stantive suggestions  for  conceptualizing  problems  and  solutions  through- 
out the  project. 

The  criticality  study  that  was  part  of  Task  1 could  not  have  been 
conducted  without  the  cooperation  of  many  people.  MAJ  Douglas  W.  Smith, 
ARI  Senior  R&D  Coordinator  at  Fort  Knox,  assisted  in  recruiting  and 
scheduling  subjects.  Carolyn  Harris  assisted  in  designing  the  study. 

The  officers  who  served  as  subjects  were,  as  usual,  gracious  and  coopera- 
tive. 
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CRITICALITY  AND  CLUSTER  ANALYSES  OF  TASKS  FOR  THE  MA8A5,  M60A1 
AND  M60A3  TANKS 


The  training  needs  of  Reserve  Components  are  changing.  The  M48A1 
tank,  which  la  the  second  most  prevalent  In  the  National  Guard  Inven- 
tory, Is  being  replaced  by  the  M48A5.  Personnel  turbulence,  always  a 
problem  In  Reserve  Components,  promises  to  become  even  greater  with  the 
elimination  of  the  draft,  and  as  the  result  of  expiration  of  the  eight- 
year  commitments  of  Guardsmen  who  entered  service  during  Che  Vietnam 
build-up.  In  addition  to  problems  associated  with  equipment  and  pers- 
onnel turbulence,  the  costs  of  ammunition,  real  estate,  range  and 
hardware  maintenance,  targets,  fuel,  transportation,  and  replacement 
equipment  continue  to  Increase. 

One  effect  of  the  trends  noted  above  Is  that  existing  training  for 
Armor  and  Cavalry  Reserve  Components  Is  becoming  increasingly  Inappropri- 
ate and  obsolete.  As  old  equipment  is  replaced  with  new,  the  training  for 
operation  and  maintenance  of  the  old  equipment  becomes  Inappropriate,  and 
the  need  for  new  training  becomes  more  compelling.  As  experienced  Guards- 
men are  replaced  with  Inexperienced  personnel,  training  that  focuses  on 
higher  level  skills  becomes  insufficient,  and  training  on  basic  skills 
becomes  necessary.  And  as  costs  Increase,  training  that  depends  on  large 
quantities  of  annmnltion,  on  frequent  service  practice  firing,  and  on  travel 
to  and  from  training  sites  becomes  less  acceptable,  and  the  need  for  train- 
ing that  can  be  delivered  at  armories  becomes  more  obvious. 

In  the  course  of  designing  nearly  any  instructional  program,  several 
difficult  problems  must  be  solved.  These  include: 

1.  How. to  select  tasks  or  objectives  for 
inclusion  in  training. 

2.  How  to  group  tasks  for  optimal  efficiency 
of  presentation  in  training. 


A comDon  method  of  selecting  tasks  for  inclusion  in  training  is  to 
do  80  on  the  basis  of  task  criticality;  that  is,  to  address  only  those 
tasks  whose  mastery  is  most  critical  to  effective  performance  on  the  job. 
Measuring  task  criticality  Is,  however,  fraught  with  problems.  Raters 
may  not  agree  on  which  casks  are  most  critical  (a  reliability  problem) , 
and  the  racings  may  be  influenced  by  considerations  other  than  criticality 
(a  validity  problem).  If  measuring  criticality  is  unreliable,  invalid, 
or  both,  then  decisions  about  training  content  based  on  criticality  mea- 
surement are  bound  Co  be  in  error. 

Even  if  perfect  reliability  and  validity  were  achieved  in  decisions 
about  training  content,  the  problem  of  bridging  the  gap  between  a task 
list  and  sets  of  casks  or  objectives  grouped  for  optimal  presentation  in 
training  would  remain.  The  issue  of  grouping  tasks  for  training  has  been 
addressed  indirectly  in  basic  research  on  behavior  classification  and 
types  of  learning.^  It  has  been  addressed  more  directly  in  applied  work 
on  methods  for  training  development,^*^**'  usually  as  a prelude  to  selecting 
media,  materials,  and  methods.  Sorting  tasks  for  presentation  in  training 
is  necessarily  a subjective  matter,  and  little  is  known  about  the  relia- 
bility of  the  results  obtained.  Adoption  of  the  methods  for  sorting  tasks 
has  not  been  widespread,  perhaps  because  users  find  Implementation  diffi- 
cult. To  the  extent  that  methods  for  sorting  tasks  could  be  routlnlzed, 
two  benefits  would  seem  to  accrue:  The  methods  might  become  easier  to  use, 
and  Che  reliability  of  the  results  obtained  might  Increase. 


^See,  for  example,  Gagn4,  R.H.  The  Conditions  of  Learning.  New  York, 

New  York:  Holt,  Rinehart  and  Winston,  1965. 

^Cropper,  G.L.,  and  Short,  J.G.,  Handbook  for  Training  Development . 
Pittsburgh,  Pennsylvania:  American  Institutes  for  Research,  1969. 

^Schumacher,  S.P.,  and  Glasgow,  A.Z.,  Handbook  for  Designers  of 
Instructional  Systems.  Wrlght-Patterson  Air  Force  Base,  Ohio: 

Aerospace  Medical  Research  Laboratories,  1973. 

'’US  Army  Transportation  School.  Interservice  Procedures  for  Instructional 
Systems  Development.  Fort  Eustls,  Virginia:  Author,  1975. 


RATIONALE 

Recognizing  the  dual  need  for  new  Reserve  Component  training  and  for 
addressing  the  training  development  Issues  outlined  above*  the  US  Amy 
Research  Institute  for  the  Behavioral  and  Social  Sciences  (ARl)  has  under- 
taken research  to: 

1.  Design  training  plana  for  operating  and 
maintaining  the  M48A5  tank. 

2.  Explore  new  methods  for  establishing  task 
criticality*  and  for  grouping  tasks  for 
presentation  In  training. 

This  project  Is  part  of  that  research. 

PURPOSE 

The  ultimate  purpose  of  the  project  la  to  design  training  for 
Reserve  and  National  Giiard  units  that  ixse  M48A5  tanka.  This  report 
describes  tlie  work  performed  during  Task  1,  whose  purposes  were  to: 

1.  Generate  and  organize  task  data  for  the 
M48A5,  H60A1,  M60A3,  and  XM-1  tanks. 

2.  Identify  tasks  that  are  common  and 
unique  to  the  M48A5,  M60A1,  and  M60A3. 

3.  Use  a palred-comparlaon  technique  to 
estimate  the  relative  criticality  of 
tasks  for  each  of  the  three  tanks. 


4.  Establish  the  reliability  of  the  task 
criticality  estimates. 

5.  Prepare  plans  for  Investigating  the 
validity  of  the  criticality  estimates. 

6.  Use  cluster  analysis^ to  group  tasks 
Into  "skills*"  according  to  descriptors 
that  have  Implications  for  training 
design. 

7.  Estimate  the  criticality*  and  the  diffi- 
culty of  learning  and  evaluating  each  of 
the- task  groups  or  "skills"  Identified  as 
the  result  of  Item  6*  above. 

^Hartlgan*  J.A.  Direct  clustering  of  a data  matrix.  Journal  of  the 
American  Statistical  Association.  67.  1972. 

^Dlxon*  W.J.*  (Ed.).  BMDP:  Biomedical  Computer  Programs.  Berkeley* 
California:  University  of  California  Press*  1975. 
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ORGANIZATION  OF  THE  REPORT 


How  eadi  of  the  objective*  listed  above  was  achieved  Is  described  In 
four  Major  sections  of  the  report: 

1.  "Generating  and  Organising  Task  Data"  addresses 
the  first  and  second  objectives  listed  above. 

2.  "Task  Criticality"  addresses  the  third,  fourth, 
and  fifth  objectives. 

3.  "Cluster  Analysis"  addresses  the  sixth  (Ajec- 
tlve. 

4.  "Skill  Criticality,  Learning  Difficulty,  and 
Evaluation  Difficulty"  addresses  the  seventh 
objective. 


GENFJRATINC  AND  ORGANIZING  TASK  DATA 


1 


The  project  began  with  generating  and  organizing  task  data.  The 

I • 

task  lists  would  be  used  later  in  the  project  In  a study  of  task  criti- 
cality and  in  exploring  the  utility  of  cluster  analysis  as  a method  of 
grouping  tasks  for  presentation  In  training. 

Four  tanks  were  addressed,  in  order  to  Include  systems  used  at  present, 

and  systems  planned  for  use  In  the  future: 

1.  The  M60A1,  which  now  predominates  In  the  Active 
i Army  and  National  Guard. 

I 2.  The  M60A3,  an  improved  (retrofitted)  version 

of  the  M60A1. 

3.  The  M48A5,  which  Is  replacing  the  second  most 
prevalent  tank  In  the  National  Guard  (the  M48A1) 
and  will  thus  become,  with  the  M60A1,  the  "staple" 

\ for  Reserve  Components. 

4.  The  XM-1,  which  eventually  will  become  the  US  Aron's 
main  battle  tank. 

METHOD 

Task  lists  for  both  XM-1  prototypes  were  written,  using  preliminary 
training  outlines,  equipment  data,  and  manuals  that  were  available  at 
the  time.  The  task  lists  have  been  presented  elsewhere,^  but  were  not 
used  In  later  project  work  since  the  data  were  preliminary  and  subject  to 
change. 

Assembling  the  task  data  for  the  other  three  tanks  began  with  a 
review  of  operations  and  maintenance  tasks  that  had  been  rated  critical 
or  Important  In  earlier  studies  by  the  US  Army  and  Its  contractors.  This 
preliminary  task  pool  or  data  base  was  supplemented  with  tasks  from  a 
recent  report  on  tank  gunnery  testing,^  from  operators'  manuals  and 

^O'Brien,  R.E.,  and  Boldovicl,  J.A.  Task  Lists  for  Chrysler  XM-1  Prototype 
(Project  Memorandum  No.  3).  Fort  Knox,  Kentucky:  Human  Resources  Research 
Organization  (HumRRO) , 1976. 

^Boldovicl,  J.A.,  Wheaton,  G.R.,  and  Boycan,  G.G.  Selecting  Items  for  a 
Tank  Gunnery  Test.  Fort  Knox,  Kentucky:  Human  Resources  Research 
Organization  (HinnRRO) , 1976. 
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equipment  data,  and  from  additions  based  on  local  expertise.  The  sources 
for  the  task  data  are  presented  In  Table  1,  with  sutmarles  of  the  main 
differences  between  the  M60A1  task  list  and  the  lists  for  the  other  two 
tanks.  Additional  details  about  generating  and  organizing  the  t«uik  data 
are  presented  In  Appendix  A. 

RESULTS 

Separate  task  lists  for  the  M60A1,  H48AS,  and  M60A3  were  presented 
under  separate  cover. ^ A conblned  list,  showing  tasks  that  are  common  and 
unique  to  the  three  tanks.  Is  presented  In  Appendix  B.  The  cluster  desig- 
nations and  criticality  scores  In  Appendix  B can  be  Ignored  now;  they 
will  be  discussed  later.  Tasks  In  Appendix  B that  are  common  or  unique 
to  the  three  tank  systems  can  be  identified  by  either  or  both  of  two 
methods.  The  flzft  two  tasks  In  the  Driver's  list  appear  In  Appendix  B 


as: 

TASK  NO. 

TASK 

CRITICALITY 
M60A1  M48AS 

M60A3 

AD105 

Install  the  M27  periscope 

5.355 

4.402 

A5111 

Install  the  M2 7 periscope  (spare) 

4.348 

The  first  task  (AD105)  has  entries  In  the  criticality  columns  under  M60A1 
and  M60A3,  but  not  under  M48A5.  This  Indicates  that  the  task  is  performed 
by  M60A1  and  the  M60A3  Drivers,  but  not  by  M48A5  Drivers.  The  second  task 
(A5111) , has  an  entry  In  the  criticality  column  under  M48A5,  but  not  under 
M60A1  or  M60A3.  This  Indicates  that  the  task  Is  performed  by  M48A5  Drivers, 
but  not  by  M60A1  or  M60A3  Drivers. 

A less  direct  method  of  Identifying  tasks  that  are  unique  or  common 
to  the  three  tanks  Is  by  using  the  task  code  numbers  (extreme  left  colusn 
i of  Appendix  B) . The  codes  are  explained  In  Appendix  C . 


^Harris , J .H . Task  Lists  for  M60A1.  M60A1(AOS).  M48A5.  and  M60A3  Tanks 
(Project  Memorandum  No.  1) . Fort  Knox,  Kentucky : Human  Resources  Research 
Organization  (HunRRO),  1976. 
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DATA  SOURCES  FOR  THE  TASK  LISTS,  AND 
SUMMARY  OF  DIFFERENCES  BETWEEN  THE  M60A1 
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TASK  CRITICALITY 

Training  resource  limitations  demand  that  choices  be  made  about 
what  to  Include  In  training,  and  what  to  exclude.  Agreement  seems 
widespread  that  training  programs  should  minimally  include  tasks  that 
are  critical  to  effective  Job  performance  (and  cannot  be  performed  by 
new  trainees).  In  military  training  contexts,  this  reduces  to  Includ- 
ing In  training  those  tasks  that  are  essential  (critical)  to  effective 
performance  In  combat.  Since  combat  cannot  be  realistically  simulated, 
a measurement  problem  Immediately  arises;  namely,  how  to  measure 
criticality. 

Prescriptive  training  development  literature  such  as  the  Inter- 
service  Procedures  for  Instructional  Systems  Development^  typically 
mentions  task  criticality  as  an  Important  consideration  in  determining 
training  content.  The  literature  is,  however,  vague  on  the  question  of 
how  to  measure  criticality,  and  silent  on  the  measurement  Issues  associ- 
ated with  criticality  estimation. 

Conventional  training  development  methods  deal  with  the  problem  of 
selecting  tasks  for  Incliislon  In  training  In  the  following  way:  A job 
analysis  Is  conducted,  resulting  In  a task  list  or  "Inventory."  Expert 
judgment  is  then  used  to  rate  the  criticality  of  each  task  on  some  n- 
polnt  scale  ranging  from  "Irrelevant  to  the  job"  to  "hl^ly  critical  to 
mission  accomplishment."  The  tasks  receiving  the  highest  ratings  are 
selected  for  Inclusion  In  training,  and  those  receiving  low  criticality 
ratings  are  excluded  or  deemphaslzed . Since  the  content  of  training 
frequently  Is  determined  on  the  basis  of  criticality  ratings,  a question 
naturally  arises  as  to  how  much  confidence  can  be  placed  In  the  ratings. 
One  index  of  confidence  is  Inter-rater  reliability:  to  the  extent  that 

^US  Army  Transportation  School,  0£.  clt. . 1975. 
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several  raters  Independently  produce  similar  criticality  ratings,  confi- 
dence In  the  job-relevance  of  training  content  based  on  the  ratings 
Increases.  The  test-development  axiom  Is  directly  analogous:  relia- 
bility Is  necessary  for  validity.  Applied  to  training  content,  the  axiom 
becomes  "reliability  (of  criticality  ratings)  Is  necessary  for  Job- 
relevance  (of  training  content)." 

The  reliability  of  criticality  ratings  that  are  used  for  determin- 
ing training  content  seldom  Is  reported. In  the  few  Instances  where 
reliability  has  been  reported^  rater  agreement  has  been  poor  — too  low 
In  fact  for  the  ratings  to  be  of  practical  use.  An  exception  appears 
In  a recent  test-development  project*':  Two-hundred  forty  tank  gunnery 
tasks  were  ranked  In  terms  of  criticality,  which  was  determined  by  the 
use  of  a paired-comparison  technique.  The  Tank  Commanders  serving  as 
subjects  were  presented  with  many  pairs  of  target/range  combinations. 

(An  example  of  a pair  of  target/range  combinations  is  tank  at  2000 
to  2500  meters,  and  llg^t-armored  vehicle  at  500  to  1000  meters.)  The 
subjects  were  Instructed  to  assume  that  they  had  encountered  each  pair 
of  target/range  combinations  on  the  battlefield,  and  that  they  could  not 
engage  both  targets  simultaneously.  They  were  then  asked  to  Indicate 
which  one  of  the  two  target/range  combinations  that  comprised  each  Item 
they  would  engage  first.  A criticality  score  was  computed  by  counting 
the  number  of  times  each  condblnatlon  was  chosen  as  more  threatening 
("would  be  engaged  first")  and  dividing  by  the  number  of  times  it  could 
have  been  chosen.^  Inter-rater  reliability  was  In  the  high  nineties. 


^McCluskey,  M.R.,  Jacobs,  T.O.,  and  Cleary,  F.K.  Systems  Engineering 
of  Training  for  Eight  Combat  Arms  MOSs . Alexandria,  Virginia:  Human 
Resources  Research  Organization  (HumRRO) , 1975. 

^McKnight,  J.A.  and  Hundt,  A.G.  Driver  Education  Task  Analysis;  The 
Development  of  Instructional  Objectives.  Alexandria,  Virginia:  Human 
Resources  Research  Organization  (HumRRO),  1972. 

^Ammerman,  H.L.  and  Fratzner,  F.C.  Occupational  Survey  on  Auto  Mechanics! 


Task  Data  from  Workers  and  Supervisors  Indicating  Job  Relevance  and 
Training  Criticalness.  Columbus,  Ohio:  Ohio  State  University,  1975. 

‘'Boldovlcl,  J.A.,  Wheaton,  G.R.,  and  Boycan,  G.G.,  0£.  clt. . 1976. 

*Gullford,  J.P.  PsychometrlcJlethods . New  York,  New  York:  McGraw  Hill, 
195A. 
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Since  the  rated  items  varied  only  in  target  type  and  range,  th^  Judg- 
ments about  target  threat  or  criticality  were  easy  to  make.  The  high 
degree  of  rater  agreement  probably  also  reflected  certain  learning 
experiences  that  the  subjects  had  in  common:  Tank  Comnanders  receive 
formal  Instruction  in  assessing  target  threat.  The  high  inter-rater 
reliability,  therefore,  may  simply  have  indicated  that  all  of  the  sub- 
jects had  learned  "the  same  things."  One  wonders  then,  whether  similarly 
high  inter-rater  reliability  could  be  achieved  using  the  paired-comparison 
technique  with  a heterogeneous  sample  of  tasks,  where  the  dimensions  for 
making  the  criticality  judgments  were  less  obvious  than  target  type  and 
range,  and  where  the  subjects  had  not  received  formal  instruction  in 
making  judgments  of  the  kind  required  for  the  ratings.  The  present 
study  provided  for  answering  the  question. 


The  purpose  of  the  study  was  to  use  a paired  comparison  technique  to 
^ ^estimate  the  relative  criticality  of  armor  tasks  rated  critical  and 

Important  in  earlier  studies,  and  to  establish  the  inter-rater  reliability 
of  the  estimates  produced  in  the  present  study. 

f 

METHOD 

Respondents 

Forty-eight  captains,  who  were  enrolled  in  the  Armor  Officers' 

Advanced  Course  (AOAC)  at  Fort  Knox  during  the  conduct  of  the  study, 
served  as  respondents. 

Ques  tionnaires 

Twelve  forms  of  a paired  comparison  questionnaire  were  used.  The 
units  of  comparison  in  each  form  were  the  tasks  for  one  of  four  crew 
positions  (Driver,  Loader,  Gunner,  or  Tank  Commander)  in  one  of  three 
tanks  (M60A1,  MA8A5,  M60A3) . 
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The  design  of  each  form  of  the  questionnaire  can  be  Illustrated  by 
describing  how  the  form  for  the  M60A1  Driver  tasks  was  designed.  Seventy 
M60A1  Driver  tasks  were  identified  during  the  task-description  part  of 
the  project.  The  number  of  possible  different  pairs  of  70  tasks  is  • 

70  X 69/2  ■■  2415.  This  would  have  been  too  many  judgments  for  each 
respondent  to  make.  A partial  paired  comparison  design^  was  therefore 
used.  In  which  each  of  the  70  tasks  was  paired  with  each  of  seven  other 
casks.  The  partial  pairing  yielded  245  unique  pairs  of  tasks  for  the 
H60A1  Driver.  The  numbers  of  pairs  of  casks  for  the  other  11  forms  of 
the  questionnaire  are  shown  In  Table  2.  Details  of  how  the  cask  pairs 
were  formed  are  presented  In  Appendix  D. 

Procedure 

The  Captains  who  volunteered  for  participation  In  the  study  were 
Instructed  to  be  at  a designated  site  at  a particular  time.  Each  of  the 
first  12  to  arrive  was  given  a different  form  of  the  questionnaire. 

Each  of  Che  next  12  was  given  a different  form,  and  so  forth,  until  each 
of  Che  12  forms  had  been  given  to  four  respondents. 

The  respondents  were  Instructed  to  assume  that  they  were  company 
commanders  choosing  crew  menbers  to  take  on  a mission  in  which  fire  would 
be  exchanged  with  Che  enemy.  They  were  Chen  asked  to  Indicate  which  of 
two  crew  members  they  would  choose,  based  on  whether  the  crew  member 
could  do  one  or  the  other  of  a pair  of  tasks.  An  example  of  a pair  of 
tasks  for  the  M60A1  Loader  is: 

1.  Inspect  an  M219  machinegun. 

2.  Stow  main  gun  rounds  In  tank. 

The  respondents  were  Informed  that  If  they  chose  1 In  Che  example,  they 
would  get  a Loader  who  could  Inspect  the  machinegun  but  could  not  stow 
main  gun  rounds.  If  they  chose  2,  they  would  get  a Loader  who  could  stow 
rounds  but  could  not  Inspect  the  M219. 

^McCormick,  E.J.  and  Bachus,  J.A.  Paired  comparison  ratings.  I.  The 
effect  on  ratings  of  reductions  In  the  number  of  pairs.  Journal  of 
Applied  Psychology.  April,  1952. 
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I Each  respondent's  questionnaire  dealt  with  only  one  crew  position 

and  only  one  tank.  The  respondents  completed  their  questionnaires  at 
home,  and  were  encouraged  to  call  a member  o£  the  project  sta^f  if 
questions  arose. 

Additional  details  about  the  instructions  to  the  respondents  may 
I be  found  in  Appendix  E. 

i 

RESULTS 

Criticality  values  were  calculated  for  each  of  the  twelve  sets  of 
tasks  by  a standard  three  step  procedure.^  First,  the  number  of  times 
a task  was  chosen  by  the  respondents  was  converted  to  a proportion  by 
dividing  by  the  nxxnbeT  of  times  it  could  have  been  chosen.  The  number 
of  times  a task  could  have  been  chosen  was  the  product  of  the  number  of 
respondents  (three  or  four)^  and  the  number  of  pairings  for  the  task 
(six  or  seven).  The  proportions  were  then  changed  to  normal  deviates,  z. 

Finally,  the  z values  within  each  task  set  were  transformed  to  standard 
scores  with  a mean  of  5.00  and  standard  deviation  of  1.00.  Ihls  final 
transformation  placed  the  12  sets  of  values  on  a similar  positive  scale. 

Criticality  values  of  the  taiiks  are  shown  by  tank  and  duty  posi- 
tion in  Appendix  B.  Tasks . representative  of  the  high  and  low  ends  of 
the  criticality  scale  are  shown  in  Figure  1,  where  it  can  be  seen  that 
the  top  rated  tasks  are  those  that  would  be  expected  by  one  familiar 
with  tank  operations:  the  Tank  Commander  acquiring  targets,  the  Tank 
Commander  or  Gunner  firing  the  main  gun,  the  Loader  loading,  and  the 
Driver  driving  tactically. 

3 

^Guilford,  J.P.  OP.  cit. . 1954.  ^ 

****** 

^Three  Captains  did  not  return  their  questionnaires.  ] 


CREW 

POSITION 

CRITICALIIY 

TASK 

! High 

. Acquire  Ground  Targets  (night) 

. TC  Fires  Main  Gun  Precision  Using  RFD 
(BEEHIVE) 

Tank 

Commander 

. Zero  Tank  Main  Gun 

Low 

. Boresigjit  Searchlight  Using  Alternate 

Method  (XENON) 

. Troubleshoot  M2  Machinegun 
. Remove  Periscope  M36E1  Head  Assembly 

i 

1 

1 

Gunner 

High 

1 . Fire  Main  Gun  Precision  Using  TEL  (Sta/Mov) 

. Immediate  Action  In  Case  of  Main  Gun 

Failure  to  Fire 

. Perforins  Main  Gun  Prepare-To-Fire  Pro- 
cedures 

Low  1 

i 

1 

. Position  Gun  Tube  In  Cradle  In  Response 

To  Signals 

. Place  Turret  Into  Manual  Operation 
. TC  Fires  Nonprecision  .50  Caliber  Using 

TPI  (Sta/Mov) 

High  i 

i 

. Perform  Emergency  Closing  of  Main  Gun 

Breech 

j 

. Load  Tank  Main  Gun 

Loader 

i 

1 

i 

1 

. Perform  Main  Gun  Prepare-To-Flre  Procedures 
(Loader's  Station) 

Low  1 

. Perform  Before-Operations  Checks  On  Air 
Cleaners 

1 1 

. Remove  M37  Periscope 

1 i 

. Check  Track  Tension 

High  1 

. Perform  Evasive  Maneuvers  On  Enemy  Contact 

Driver 

j 

! 

i 

1 

. Move  Vehicle  Into  Defilade  On  Enemy  Contact 
. Perform  Before-Operations  Checks  On  Engine 
And  Transmission 

Low  { 

1 

. TC  Fires  Nonprecision  Coax  Using  RFI  (Sta/ 
Mov) 

1 

. Place  Turret  Into  Power  Operation 

1 

1 

. Perform  After-Operations  Checks  On  Fender 
And  Stowage  Boxes 

Figure  1.  Te?sks  representing  the  extremes  in 
criticality  ratings. 


Inter-rater  reliability  was  estimated  by  correlating  scale  values 
for  tasks  conmon  to  the  three  tanks.  For  example,  27  of  the  113  Loader 
tasks  are  performed  by  Loaders  on  both  the  M60A1  and  the  M60A3;  the 
two  Independently  obtained  sets  of  scale  values  for  these  27  tasks  were 
correlated.  Correlations,  computed  by  crew  position  In  this  manner  for 
each. pair  of  tanks,  are  shown  in  Table  3.  They  ranged  from  .55  to  .79, 
with  an  average  of  .68.  All  were  statistically  significant  (p  < .05). 

Table  3 

RELIABILITY  OF  CRITICALITY  RATINGS 
FOR  TASKS  COMMON  TO  PAIRS  OF  TANKS 


Tank 

Pair 

Crew 

Position^N. 

M60A1  , .1 
M48A5 

M60A1  . . 
M60A3 

M48A5  ( . 

AVG2 

Commander 

.69  (32) 

.59  (16) 

.79  (7) 

.70 

Gunner 

.71  (35) 

.72  (17) 

.71  (12) 

.72 

Loader 

.55  (61) 

.65  (27) 

.64  (25) 

.62 

Driver 

.74  (41) 

.64  (44) 

.65  (27) 

1 .68 

1 

1 


^(N)  * Number  of  tasks  common  to  the  pair  of  tanks. 

^AVG  ■ Means  based  on  Fisher's  Sj.  transformation,  from  Snedecor,  G.W. 
and  Cochran,  W.G.  Statistical  Methods  (Sixth  Edition). 

Ames,  Iowa:  Iowa  State  University  Press,  1967. 


« 

1 

i 

1 


15 


DISCUSSION 


The  criticality  ratinss  and  inter-rater  reliability  raise 

separate  issues  for  discussion,  as  do  questions  about  the  validity 

> 

of  the  results  obtained. 

Criticality 

The  tasks  Chat  were  rated  high  in  criticality  make  sense  from 
a rational  or  Intuitive  point  of  view.  Tank  Commanders  acquiring 
targets.  Gunners  firing  Che  main  gun.  Loaders  loading,  and  Drivers 
driving  tactically,  all  seem  essential  for  effective  performance  in 
combat.  But  the  low-rated  tasks  — Check  Track  Tension,  for  example, 
and  Place  Turrep  in  Manual  Operation  — present  some  interpretive 
difficulty.  The  raters'  judgments  may  have  been  Influenced  by  the 
likelihood  that  another  crewman  could  perform  the  task  if  the  designated 
crewman  could  not,  or  that  the  task  would  not  have  to  be  perfoinaed 
during  a combat  mission.  Recall  also  that  all  the  rated  tasks 
had  been  designated  in  earlier  studies  as  critical  or  Important. 

Reliability 

The  reliability  of  the  criticality  data,  though  statistically 
significant  and  probably  greater  than  the  reliabilities  of  criti- 
cality ratings  in  studies  using  absolute  ratings,^  seems  only  margin- 
ally acceptable  in  a practical  sense:  With  a mean  inter-rater 
reliability  of  .68,  the  common  variance  is  only  about  50  percent. 
Considering  the  size  of  the  training  investments  that  are  made  to 
teach  tasks  whose  criticality  is  established  by  methods  lees  rigorous 
than  the  one  used  here,  a search  for  ways  to  Increase  the  reliability 
of  criticality  ratings  seems  warranted.  Comparing  characteristics 
of  the  present  study  with  characteristics  of  other  studies  may  be 
instructive.  No  studies  other  than  Boldovlci  ^ al.^  could  be  found 


^See  for  example,  Harris,  J.H.,  Campbell,  R.C.,  Osborn,  W.C.,  and 
Boldovicl,  J.A.  Development  Of  A Model  Job  Performance  Test  For  A 
Combat  Occupational  Specialty.  Volume  1.  Test  Development.  Fort 
Knox,  Kentucky:  Hwnan  Resources  Research  Organization  (HumRRO) , 1975. 

^Boldovicl,  J.A.,  Wheaton,  G.R.,  and  Boycan,  G.G.,  c2,.  cit. . 1975. 
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In  which  reliabilities  of  criticality  estimates  higher  than  those 
obtained  here  were  reported.  The  earlier  study  differed  from  the 
present  one  In  several  Important  respects. 

The  dimensions  on  which  judgments  were  made  were  more  obvious 
In  the  earlier  study  than  In  the  present  one.  Target  type  and  tar- 
get range  were  the  only  dimensions  along  which  Items  were  varied  In 
the  earlier  study.  In  the  present  study,  the  dimensions  along  which 
criticality  judgments  were  to  be  made  were  less  clear.  Respondents 
were  simply  asked  to  choose  who  they  would  want  to  take  into  combat, 
based  on  tasks  that  could  or  could  not  be  performed  by  the  chosen 
crew  member.  %e  obvious  difficulty  here  Is  that  the  nature  of  the 
combat  or  the  mission  was  not  specified  as  clearly  as  It  could  have 
been.  Respondents  were  told  only  that  the  mission  would  Involve 
exchanging  fire  with  the  enemy.  Given  such  a vague  set,  respondents 
could  and  undoubtedly  did  "make  up"  missions,  which  differed  from 
one  respondent  to  another.  Depending  on  the  anticipated  mission,  one 
could,  for  example,  just  as  easily  justify  choosing  a Loader  who 
could  stow  main  gun  rounds  as  choosing  a Loader  who  could  Inspect  an 
M219  machlnegun.  If  the  respondent  doing  the  ratings  was  thinking  of 
a recon-by-flre  mission  or  encountering  soft  targets  hidden  In  a cane 
field,  his  choice  of  a Loader  would  be  different  from  the  choice  of  a 
respondent  who  was  thinking  of  tank-to-tank  combat. 

The  earlier  study.  In  contrast  to  the  present  one,  left  little 
room  for  subjects'  "making  up"  the  dimensions  along  which  their 
judgments  of  criticality  would  be  made.  Given  a choice,  for  example, 
between  engaging  a tank  at  500  meters  or  a light-armored  vehicle 
at  2500  meters,  the  dimensions  for  making  the  choice  are  clear: 

1.  Which  target  is  closer?  and 

2.  Which  target  Is  more  likely  to  be  equipped  with 
the  ammunition,  and  other  means  for  killing  me? 
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The  tank  at  500  meters  wins  on  both  counts.  More  importantly,  given 
the  absence  of  opportunity  for  engaging  both  targets  simultaneously, 
few  if  any  tankers  would  disagree  with  the  decision  to  engage  the 
tank  at  500  meters  before  engaging  the  light-armored  vehicle  at 
2500  meters.  This  leads  to  a second  salient  difference  l>etween  the 
present  and  the  earlier  study. 

Subjects  in  the  earlier  study  had  certain  learning  experiences 
in  common,  which  contributed  substantially  to  high  agreement  about 
which  one  of  two  targets  to  engage  first:  As  noted  earlier.  Tank 
Commanders  receive  formal  instruction  in  assessing  target  threat. 

Tne  high  inter-rater  reliability,  therefore,  may  be  viewed  simply 
as  an  index  of  phe  extent  to  which  all  Tank  Commanders  had  learned 
the  "same  things." 

Another  Important  difference  is  that  the  earlier  study,  while  it 
did  not  use  complete  pairings,  more  closely  approximated  a complete 
pairing  design  than  did  the  present  study.  To  the  extent  that  com- 
plete pairings  eliminate  the  "luck  of  the  draw"  in  determining  which 
tasks  get  paired  with  one  another,  inter-rater  reliability  would  be 
expected  to  Increase  with  increases  in  the  number  of  possible  pairs. 
Some  support  for  this  hypothesis  is  suggested  in  the  literature, 
though  the  studies  cited  differed  in  many  important  respects  from  the 
present  one;  in  the  number  of  .raters,  for  example,  in  the  total  number 
of  stimulus  items,  in  numbers  of  ratings  per  pair  of  items,  and  in 
kinds  of  dependent  variables. 

^McCormick,  E.J.  and  Bachus,  J.A. , op.  cit. , 1952. 

^McCormick,  E.G.  and  Roberts,  W.K.  Paired  comparison  ratings. 

2,  The  reliability  of  ratings  based  on  partial  pairings.  Journal 
of  Applied  Psychology.  1952. 

^Rambo,  W.W.  Paired  comparison  scale  value  variability  as  function 
of  partial  pairing.  Psychological  Reports.  1959. 

**Rambo,  W.W.  The  effects  of  partial  pairing  on  scale  values  derived 
from  the  method  of  paired  comparisons.  Journal  of  Applied  Psychology, 
1959. 


Finally,  each  stimulus  ("task")  was  rated  by  more  judges  in 
the  earlier  study  than  in  the  present  study.  To  the  extent  that 
increasing  the  number  of  judges  per  stimulus  decreases  systematic 
bias  in  the  ratings,  inter-rater  reliability  would  be  expected  to 
Increase  with  Increases  In  the  number  of  judges. 


Validity 

The  conduct  of  this  or  any  other  study  that  purports  to  measure 
task  criticality  raises  questions  about  the  validity  of  the  results 
obtained,  namely: 

1.  Construct  validity:  To  what  extent  has  what 
has  been  purported  to  have  been  measured  (that 
is,  task  criticality)  actually  been  measured? 

Or,  to  what  extent  has  inadvertent  measure- 
ment of  constructs  other  than  criticality 
affected  the  result's  obtained? 

2.  Content  validity:  To  what  extent  do  the  "items" 
(tasks)  used  in  the  questionnaires  represent 
the  universe  of  items  or  tasks? 

3.  Predictive  validity:  To  what  extent  would  the 
criticality  scores  or  predictions  made  from 
them,  correlate  with  a direct  measure  of 
criticality? 


Construct  Validity.  The  instructions  to  the  raters  in  the 
present  study  were  intended  to  create  a set  for  judging  criticality 
and  criticality  alone.  But  the  extent  to  which  the  subjects' 
judgments  were  Influenced  by  extraneous  considerations  such  as  learning 
difficulty,  performance  difficulty,  performance  frequency,  and  the 
like  is  unknown.  Questions  about  construct  validity  will  remain  as 
long  as  reasonable  counterinterpretations  of  the  results  can  be 
advanced.^  Construct  validity  cannot  therefore  be  established  by 
conducting  a "one-shot"  study.  A plan  for  Initiating  examination  of 


^Cronbach,  L.J.  Test  validation.  In  R.L.  Thorndike,  (Ed.)  Educational 
Measurement  (Second  Edition) . Washington,  D.C.:  American  Council 
on  Education,  1976. 
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the  construct  validity  of  criticality  as  measured  here  is  presented 
in  Appendix  F.  The  plan  is  for  a correlational  study  of  validity, 
based  on  the  work  of  Campbell  and  Flske.^  Factors  that  might  be 
expected  to  compete  with  or  contaminate  the  criticality  construct 
are  each  measured  by  two  dissimilar  methods,  as  is  criticality.  The 
underlying  assumption  is  that  measures  of  the  same  constructs  by  dissimilar 
methods  should  converge,  while  measures  of  different  constructs  by  the 
same  or  different  methods  should  diverge. 

Content  Validity.  The  issue  of  how  well  the  content  of  the 
questionnaire  sampled  the  universe  of  subject  matter  about  which  con- 
clusions were  drawn  can  never  be  fully  resolved.  Resolution  would 
require  widespread  agreement  on  the  adequacy  of  the  parameters  or 
descriptors  used  to  define  the  universe,  and  on  precise  definition 
of  what  constitutes  adequate  sampling.  In  the  present  study,  the 
"universe"  was  defined  as  consisting  of  all  tasks  rated  critical  or 
important  in  earlier  studies  by  the  Army  and  its  contractors;  and 
tasks  were  sampled  from  the  universe  for  inclusion  in  the  questionnaires 
using  the  method  described  in  Appendix  D.  To  the  extent  that  other 
investigators  would  define  the  task  universe  differently  than  was  done 
here,  would  sample  tasks  differently,  or  both,  the  question  of  content 
validity  remains  open. 

As  is  the  case  for  construct  validity,  investigation  of  content 
validity  is  not  a "one-shot"  affair.  A duplicate-construction 
experiment^  would  provide  a rigorous  test  of  content  validity:  Two 
teams  of  equally  competent  questionnaire  developers  Independently 
would  prepare  the  questionnaires  using  identical  universe  definitions 

^Campbell,  D.T.  and  Flske,  D.W.  Convergent  and  discriminant  validation 
by  the  multitrait  multimethod  matrix.  Psychological  Bulletin.  1959. 

^Cronbach,  L.J.,  0£.  cit. . 1976. 


and  rules  for  selecting  questionnaire  items.  If  the  universe  and 

sampling  are  adequately  defined,  the  two  forms  of  the  questionnaire 

will  be  equivalent.  The  results  of  an  individual's  taking  both 

forms  should  be  identical  (within  the  limits  of  sampling  error) . 

"A  favorable  result,  on  a suitable  broad  sample 
of  persons,  would  strongly  suggest  that  the 
test  content  is  fully  defined  by  the. . .construc- 
tion rules....  An  unfavorable  result  would  indicate 
that  the  universe  definition  is  too  vague  or  too 
incomplete  to  orovide  a content  interpretation 
for  the  test."^ 

A less  rigorous  examination  of  content  validity  might  be  made 
using  critical  incidents  gathered  from  veterans  of  armored  combat. 
Incidents  could  be  gathered  until,  on  the  basis  of  increasing 
redundancy  or  another  criterion,  one  was  satisfied  that  the  universe 
of  Incidents  had  been  adequately  sampled.  An  attempt  would  then 
be  made  to  match  each  task  used  in  the  questionnaires  with  at  least 
one  incident.  If  incidents  were  identified  for  which  there  was  no 
matching  task,  a basis  would  be  provided  for  questioning  the  content 
validity  of  the  questionnaires.  (If,  on  the  other  hand,  tasks  were 
identified  for  which  there  were  no  matching  critical  incidents,  this 
would  indicate  that  the  pool  of  critical  Incidents  did  not  constitute 
an  adequate  sample  of  the  task  universe.) 


Predictive  Validity.  Establishing  the  predictive  validity  of 
the  results  of  the  criticality  study  would  require  correlating  the 
obtained  criticality  scores  with  a direct  measure  of  criticality. 
Obtaining  direct  measures  of  task  criticality  in  combat  is,  of  course, 
out  of  the  question.  "Direct"  is,  however,  a relative  term.  Inter- 
mediate criteria  — combat  simulations,  for  example  — might  be  used 
in  studies  of  predictive  validity.  One  suspects,  though,  that 

^Cronbach,  L. J. , 0£.  cit . , 1976. 
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achieving  adequate  measurement  reliability  under  simulated  combat 
conditions  would  be  very  expensive  (though  absolutely  essential 
if  any  important  decisions  are  to  be  made  ba^ed  on  the  simulation 
results) . Until  reliable  intermediate  criterion  measures  are  forth- 
coming, the  door  to  establishing  the  predictive  validity  of 
criticality  ratings  will  remain  closed. 

The  more  general  question  of  how  well  indirect  measures  (ratings, 
for  example)  of  criticality  predict  more  direct  measures  may,  however, 
be  answerable.  Assume,  for  example,  that  one  could  create  a game 
with  a clearly  defined  goal,  and  with  clearly  defined  tasks  that  may 
be  performed  in  achieving  that  goal.  Assume  further  that,  by 
virtue  of  design,  the  relevance  or  criticality  of  each  task  is  known 
to  the  game's  creators.  People  could  be  taught  the  rudiments  of  the 
game,  given  practice  until  they  were  thoroughly  familiar  with  its 
play,  and  then  asked  to  Judge  criticality  of  the  various  tasks  in 
play  of  the  game.  The  correlation  between  task  ratings  and  actual 
criticality  would  offer  evidence  as  to  the  quality  of  subjective  meas- 
ures of  task  criticality  typically  made  for  real  jobs.  This  hypothetical 
game  could  also  provide  a setting  for  studying  the  quality  of  ratings 
as  a function  of  job  (game)  proficiency  and  rating  method. 

CONCLUSIONS 

1.  The  criticality  values  obtained  in  this  study  seem  to  make  sense  — 
more  so  for  the  high-rated  tasks  than  for  the  low-rated  tasks.  The 
study,  however,  dealt  only  with  tasks  that  had  been  rated  critical 
or  important  in  earlier  studies.  Because  this  was  so,  and  because 
the  present  study  generated  relative  criticality  ratings,  an 
unavoidable  outcome  was  that  some  tasks  judged  critical  in  earlier 
studies  were  judged  less  critical  in  the  present  one. 
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The  reliability  of  the  criticality  ratings  is  acceptable,  if  only 
marginally  so.  The  paired  comparison  technique  holds  promise, 
and  additional  research  would  shed  light  on  how  to  generate 
criticality  estimates  that  were  highly  reliable.  Until  such 
research  is  forthcoming,  some  tentative  operating  assumptions 
can  be  offered.  Inter-rater  reliability  in  studies  of  task 
criticality  can  be  expected  to  increase  with: 

A.  Specificity  of  the  dimensions  along  which 
criticality  ratings  are  to  be  made.  This 

probably  is  the  sine  qua  non  for  high  ^ 

rater  agreement.  To  the  extent  that  inves-  . 

tigators  can  create  a uniform  set  among  * 

raters  as  to  the  dimensions  along  which 
judgments  are  to  be  made,  rater  agreement 
should  increase.  Without  clear  specifi- 
cation of  the  dimensions  for  making 
judgments,  raters  will  "make  up"  their  own 
dimensions.  And  if  these  dimensions  differ 
from  one  rater  to  the  next,  rater  agree- 
ment will  suffer. 

B.  Common  learning  experiences  among  raters. 

The  obvious  recommendation  — that  raters 
should  practice  making  judgments  of  the 
kind  required  by  the  criticality  study  — 
is  warranted  only  when  the  condition  dis- 
cussed in  item  1,  above  is  met;  that  is, 
when  the  dimensions  for  making  the  judg- 
ments are  clearly  specified.  Practice  might 
otherwise  simply  reinforce  idiosyncratic 
rater  behavior  and  thus  reduce  rater  agree- 
ment . 

C.  The  extent  to  which  complete  pairings  of  the 
tasks  to  be  rated  is  approximated.  The 
desirability  of  eliminating  the  "luck  of  the 
draw"  in  determining  which  tasks  get  paired 
with  one  another  must,  however,  be  traded  off 
against  the  heavy  subject  workloads  that 
characterize  complete  pairings  with  large 
numbers  of  stimulus  materials. 
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D.  The  number  of  times  each  stimulus  is  rated. 

Every  subject  need  not.  rete  every  possible 
pair  of  tasks,  t.iou;ih  this  may  be  desirable. 

Decreasing  the  workload  of  each  subject  can 
be  accomplished  in  several  ways.  Partial 
pairings  can  be  used,  with  all  subjects 
rating  all  pairs.  Or  complete  pairings  can 
be  used  with  some  of  the  subjects  rating  some 
pairs  and  not  others.  Various  mixes  of  the 
approaches  also  may  be  used  — partial  pairings, 
with  some  subjects  rating  some  pairs  and  not 
others.  The  optimal  compromises  are,  unfor- 
tunately, not  knoxm.  Examinations  would  be 
interesting,  of  the  effects  of  various 
reductions  (combined  and  in  isolation)  in 
number  or  proportion  of  compared  pairs, 
number  or  proportion  of  subjects  rating  each 
pair,  and  ninnber  of  observations  per  stimulus 
and  pair  on  rater  agreement.  The  generality 
of  the  results  of  such  research  would,  of 
course,  never  be  fully  established.  Questions 
would  always  remain  about  the  effects  of 
stiiTiUlus  materials,  instructions  to  raters, 
rater  experience,  and  so  forth,  on  the  results 
obtained.  But  if  confidence  is  desired  in  the 
results  of  studies  that  purport  to  measure  the 
criticality  of  combat  tasks,  then  additional 
research  on  factors  affecting  rater  reliability 
seems  necessary. 

The  paired  comparison  method,  in  any  event,  would  seem  to  yield 
reliability  estimates  that  are  higher  than  those  found  in  more 
conventional  ratings  of  task  criticality.  But  to  be  more 
certain,  controlled  studies  comparing  various  rating  methods 
are  needed,  especially  since  inter-rater  reliability  of  criti- 
cality ratings  is  not  customarily  reported  in  Army  training 
development  literature. 


3.  The  validity  of  the  task  criticality  ratings  remains  unknown. 
Construct,  content,  and  predictive  validity  present  separate 
issues  for  consideration: 

A.  A plan  for  initiating  investigations  of 
construct  validity  has  been  presented. 

Implementing  the  plan  would  shed  light 
on  the  issue  of  the  extent  to  which  the 
present  study  measured  criticality,  as 
opposed  to  other  constructs. 
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Q.  The  Issue  of  content  validity  never  is  fully 
resolved.  Suggestions  were  made,  however, 
for  appropriate  examinations. 

C.  No  direct  measures  of  the  criticality  of  com- 
bat tasks  can  be  made,  and  intermediate 
criteria  — combat  simulations,  for  example  — 
are  likely  to  be  unreliable.  Until  reliable 
intermediate  criterion  measures  are  forth- 
coming, the  door  to  establishing  predictive 
validity  will  remain  closed.  An  approach 
was  suggested,  however,  for  addressing  the 
general  question  of  how  well  indirect  measures 
of  criticality  predict  more  direct  measures. 

Concern  with  the  validity  of  the  ratings,  though  appropriate, 

seems  premature.  Reliability  issues  associated  with  estimating 

the  criticality  of  armor  tasks  have  only  begun  to  be  raised. 

Given  a)  that  nothing  is  known  about  the  validity  of  criticality 

estimation,  and  b)  choices  between  results  of  known  and  unknown 

reliability;  training  developers  would  seem  well  advised  to  use 

results  whose  reliability  is  known. 


S 
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CLUSTER  ANALYSIS 


With  tasks  generated  and  organized  for  the  three  tank  systems,  and 
task  criticality  established  with  an  acceptable  degree  of  reliability, 
attention  was  turned  to  exploring  new  treatments  of  the  task  data.  An 
attempt  would  be  made  to  Identify  relatively  homogeneous  families  of 
tasks,  and  to  use  the  families  as  a basis  for  designing  instructional 
modules  in  Task  2 of  the  project. 

Cluster  analysis^ Is  a method  for  sorting  or  classifying  objects, 
concepts,  tasks,  or  other  "things"  by  measuring  similarities  among  pat- 
terns of  descriptors.  All  objects  or  tasks  to  be  sorted  are  first  des- 
cribed, binary-fashion  (yes-no,  present-absent),  in  terms  of  a coianon 
set  of  descriptors.  A simple  example  of  the  binary  method  of  descrip- 
tion is  shown  In  Figure  2,  where  three  tanks  have  been  characterized 
according  to  a common  set  of  descriptors . A cluster  analysis  of  the 
one-zero  data  in  Figure  2 would  sort  the  tanks  by  measuring  the  similari- 
ties among  the  patterns  of  descriptors  that  characterize  the  tanks.  The 
KA8A5  and  the  M60A1  would  form  a cluster,  because  their  descriptor  pat- 
terns (1,  0,  0,  1)  are  Identical.  The  M60A3  would  form  a separate  cluster, 
because  its  descriptor  pattern  (1,  1,  1,  1)  is  different  from  the  patterns 
for  the  H48A5  and  the  M60A1.^ 

^Hartigan,  J.A.,  op.  clt. . 1972. 

^Dlxon,  W.J.,  op.  clt. . 1975. 

^The  formation  of  clusters  is  not  as  automatic  as  described  here.  The 
process  is,  in  fact,  amalgaraative  and  comprised  of  successive  "passes" 
through  the  data.  In  the  first  pass,  each  described  object  forms  a 
cluster.  Successive  passes  form  fewer  and  fewer  clusters,  each  contain- 
ing more  and  more  of  the  described  objects,  until  in  the  final  pass,  all 
objects  are  included  in  a single  cluster.  Selecting  passes  and  clusters 
from  the  available  ones  requires  devising  and  using  guidelines  or  rules 
which  reflect  the  purpose  of  the  analysis.  This  point  is  elaborated  in 
Appendix  L. 
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Figure  2.  Example  of  one-zero  data  of  the 
kind  used  in  cluster  analysis. 

Statistical  formulations  obviously  are  not  necessary  for  sorting 
such  disparate  objects  as  tanks.  Cluster  analysis  has,  however,  been 
used  to  study  such  diverse  topics  as  neighborhood  voting  preferences,^ 
psychosis  and  anxiety,^  and  tank  gunnery  job  objectives.^  Cluster 
analysis  was  selected  for  use  in  the  present  study  in  an  attempt  to 
identify  "families"  of  armor  tasks  that  had  many  descriptors  in  common. 

If  relatively  homogeneous  families  of  tasks  could  be  identified,  the 
families  could  be  treated  as  skills,  and  efficiency  might  be  achieved 
in  training  by  designing  instructional  modules  aroimd  the  skills. 

PURPOSE 

The  main  purpose  of  this  part  of  the  project  was  to  examine  the 
utility  of  cluster  analysis  as  a method  for  sorting  armor  tasks.  As  in 
the  criticality  study,  the  issue  of  inter-rater  reliability  also  arises; 
given  Identical  descriptors,  tasks,  and  instructions,  to  what  extent  will 
raters  agree  on  their  characterizations  of  the  tasks?  A secondary  pur- 
pose was  therefore  to  examine  the  extent  of  correspondence  between  two 
Independently  generated  sets  of  one-zero  task  description  data. 


^Tryon,  R.C.  Identification  of  social  areas  by  cluster  analysis. 
University  of  California,  Publications  in  Psycholoszy.  30.  1955. 

^ryon,  R.C.  Unrestricted  cluster  and  factor  analysis  with  applications 
to  the  MMPI  and  Holtzinger-Harman  problems,  Multivariate  Behavioral 
Research,  i,  1966. 

^oldovici,  J.A.,  Wheaton,  G.R.,  and  Boycan,  G.G.,  op.,  cit . , 1976. 
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METHOD 


The  method  for  generating  the  required  one-zero  task  description 
data  was  comprised  of  two  steps: 

1.  Selecting  task  descriptors. 

2.  Characterizing  the  tasks. 


Selecting  Task  Descriptors 

Several  criteria  were  used  in  selecting  descriptors  for  charac- 
terizing the  tasks.  The  three  main  criteria  were  that: 

1.  Characterizing  the  tasks  in  terms  of  the  des- 
criptors could  be  done  with  a reasonable  degree 
of  rater  agreement.  This  was  seen  as  the  mini- 
mal test  of  the  replicability  of  the  procedures 
used  here.  The  desire  to  meet  the  requirement 
for  reasonable  inter-rater  reliability  in  turn 
suggested  other  criteria  for  selecting  the  des- 
criptors; namely,  that  the  descriptors  should  be 
definable  in  ways  that  would  be  readily  and 
iiniformly  understood  by  the  raters.  Ideally, 
the  descriptors  would  be  mutually  exclusive, 
thou^  this  was  recognized  at  the  outset  to  be 

a criterion  that  never  would  be  fully  met. 

2.  Sorting  the  tasks  in  terms  of  similarities  among 
their  descriptor  patterns  should  yield  differ- 
ential implications  for  training.  Application 
of  the  criterion  led,  as  will  be  seen  later, 

to  considering  using  existing  learning  and  task 
taxonomies  as  descriptors. 

3.  The  descriptors  should  be  comprehensive:  All 
tasks  for  the  three  tanks  should  be  describable 
in  terms  of  the  same  set  of  descriptors.  Com- 
prehensiveness may,  of  course,  be  achieved  by 
the  use  of  a single  non-discriminating  descriptor 
for  all  tasks;  "performed  by  a tank  crew  member," 
for  example.  This  consideration  led  to  a final 
loose  criterion  concerning  number  and  kind  of  des- 
criptors, which  was  applied  in  conjunction  with 
the  comprehensiveness  criterion:  The  descriptors 
were  to  be  neither  so  numerous  as  to  be  unmanage- 
able nor  so  few  as  to  mask  important  distinctions 
among  the  tasks . 
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Consideration  was  given  during  early  project  planning  to  using  the 
job-  task-elements  in  the  Position  Analysis  Questionnaire^  as  task  des- 
criptors. Any  job  or  task,  including  the  tank  crew  jobs  and  tasks 
addressed  In  this  project,  almost  certainly  can  be  described  using  the 
P.A.Q.  elements.  But  cluster  analysis  based  on  tasks  characterized  by 
the  P.A.Q.  descriptors  would  have  no  clear  implications  for  training. 
Attention  was  therefore  directed  toward  finding  a set  of  descriptors  which 
had  training  principles  or  learning  algorithms  associated  with  It.  The 
obvious  candidates  were  the  conditions  and  kinds  of  learning  described 
by  Gagn^,^  and  by  Gagne  and  Briggs^;  and  the  learning  algorithms  presen- 
ted in  the  Training  Analysis  and  Evaluation  Group's  (TAEG)  A Technique 
for  Choosing  Cost-Effective  Instructional  Delivery  Systems.*' 

Gagne's  types  of  learning  were  not  used.  Even  though  learning 
principles  are  presented  for  each,  the  eight  types  of  learning  are  hier- 
archically ordered,  so  that  any  given  type  may  subsume  ocher  types  that 
are  lower  in  the  hierarchy.  The  types  of  learning  therefore  are  not  at 
all  mutually  exclusive,  and  this  was  thou^t  to  Invite  poor  discrimina- 
tion In  the  task  characterizations  that  would  be  performed  later. 

The  TAEG's  twelve  learning  types  seemed  "less  hierarchical"  than 
Gagne's,  but  here  again  unreliability  In  task  ratings  seemed  to  be 
Invited  by  the  algorithms'  not  being  mutually  exclusive.  Many  tasks  and 
subtasks  can  be  Imagined,  for  example,  that  one  rater  would  call  "Rule 
Learning  and  Using,"  that  another  rater  would  call  "Making  Decisions," 


^McCormick,  E.J.,  Mecham,  R.C.,  and  Jeanneret,  P.R.  Position  Analysis 
Questionnaire  (PAO) . West  Lafayette,  Indiana:  PAQ  Services,  Inc., 
1972. 

^Gagn£,  R.M.,  o£,.  clt. . 1965. 

^GagnA,  R.M.,  and  Briggs,  L.J.  Principles  of  Instructional  Design. 

Mew  York,  Mew  York:  Holt,  Rinehart  and  Winston,  Inc.,  1974. 

*’3raby,  R.,  Henry,  J.M.,  Parrish,  W.F.,  Jr.,  and  Swope,  W.M.  A Tech- 
nlcue  for  G-.ooi'ing  Cost-effective  Instructional  Delivery  Systems  (TAEG 
Report  No.  16).  Orlando,  Florida:  Department  of  the  Navy,  Training 
Analysis  and  Evaluation  Group,  1975. 


and  that  yet  another  would  call  both.  In  rcvicwins  the  TAEG  reports  we 
also  noticed  that  the  training  guidelines  associated  with  each  of  the 
twelve  kinds  of  learning  were  highly  similar.  Thus  if  the  TAEG  system 
were  used,  one  might  end  with  no  clear-cut  implications  for  differentially 
applying  the  guidelines  to  each  kind  of  learning.^ 


Reviewing  the  systems  disc\issed  above  prompted  the  thought  that 
using  a set  of  descriptors  comprised  of  four  subsets  might  produce 
results  that  had  differential  Implications  for  training: 

1.  A Stimuli  subset,  which  would  allow  noting  for 
each  task  and  subtask  the  cues  that  initiated 
and  maintained  performance.  Describing  tasks 
in  terms  of  the  stimulus  subset  would,  it  was 
hoped,  provide  clues  later  for  specifying  or 
selecting  training  and  testing  materials,  and 
for  specifying  display  characteristics  for 
training  devices. 

2.  A subset  of  Tools,  Instruments  and  Controls, 
which  would  allow  noting  for  each  task  and 
subtask  the  manipulanda  or  mediators  of  crew 
members'  performance.  As  with  the  stimulus 
subset,  it  was  hoped  that  describing  tasks 
in  terms  of  the  tools,  instruments,  and  con- 
trols would  facilitate  selecting  training  and 
testing  materials,  and  specifying  training 
device  characteristics. 

3.  A Mediating  Processes  subset,  which  would  allow 
noting  for  each  task  and  subtask  the  kinds  of 
learning  involved  in  task  performance.  Most  of 
the  TAEG  learning  classes  could  be  used  in  this 
subset,  in  the  interest  of  providing  a fall-back 
position  in  the  event  that  clustering  tasks  on 
tne  basis  of  all  four  subsets  of  descriptors 
would  not  yield  obvious  training  implications. 

4.  An  Overt  Response  subset,  which  would  allow 
noting,  for  each  task  and  subtask,  the  motor 
behavior  involved  in  task  performance.  Des- 
cribing tasks  in  terms  of  the  Overt  Response 
subset  would,  it  was  hoped,  help  in  specifying 


^This  is  by  no  means  an  indictment  of  the  TAEG  system.  The  best  training 
methods  or  principles  for  various  kinds  of  learning  may  well  be  more 
similar  than  different.  And  there  is  certainly  no  reason  to  believe 
that  types  of  learning  should  be  or  are  mutually  exclusive.  The  point 
is  simply  that  without  mutual  exclusivity,  inter-rater  reliability  in 
task  classification  probably  will  suffer. 


control  characteristics  of  devices,  and  In 
test  development. 


As  can  be  inferred  from  the  foregoing  discussion,  the  criterion  of 
mutual  exclusivity  (and  therefore  inter-judge  agreement)  was  "traded  off" 
in  Che  Mediating  Process  subset  against  Che  apparent  desirability  of 
using  the  TAEG  descriptors,  for  which  learning  algorithms  were  readily 
available.  The  four  svbsets  of  descriptors  that  were  selected  for  use 
in  Che  study  were  an  amalgam  of  the  TAEG  classes  of  learning,  and  sev- 
eral stimulus,  tool,  test  equipment,  and  response  descriptors  that  were 
Included  for  the  sake  of  definitional  clarity,  comprehensiveness,  or  both. 
The  four  subsets  of  descriptors  are  listed  across  the  top  of  Figure  3. 
Definitions  of  Che  descriptors  are  attached  as  Appendix  G. 

Characterizing  the  Tasks 

Forms  were  printed  which  had  the  four  siibsets  of  task  descriptors 
across  the  top  of  the  page,  and  tasks  and  stib tasks  down  the  left  side. 
Figure  3 is  a part  of  one  of  the  forms.  Generating  the  task  by  des- 
criptor matrix  began  with  selecting  18  of  the  226  M60A1  tasks  for  use 
in  practicing  the  task  characterizations  or  racings . Two  criteria  were 
used  in  selecting  the  18  practice  tasks: 

1.  Each  duty  position  was  represented  in  the 
sample  in  approximately  the  same  proportion 
as  the  duty  position  is  represented  in  the 
population  of  M60A1  tasks. 

2.  The  sample  tasks  represented  the  types  of 
casks  performed  by  each  crew  member.  The 
Driver  was  represented  by  maintenance  and 
driving  tasks,  for  example,  and  the  Gunner 
by  coax  and  main  gun  Casks. 

Two  menbers  of  the  project  staff  independently  rated  the  subtasks  for 
each  of  the  18  sample  tasks.  Working  from  left  to  right  in  the  row  corres* 
ponding  to  each  subtask  (see  Figure  3),  each  rater  entered  a "1"  in  the 
columns  corresponding  to  descriptors  that  characterized  the  subtask,  and 
left  blank  the  descriptor  columns  chat  did  not  pertain  to  the  subtask. 


The  ratings  were  done  at  the  subtask  rather  than  the  task  level  in  the 
interest  of  inter-rater  reliability:  Assuming  that  greater  precision  is 
possible  in  defining  subtasks  than  in  defining  tasks,  one  would  expect 
the  reliability  of  the  ratings  to  be  greater  at  the  subtask  than  at  the 
task  level. 

The  raters  based  their  judgments  on  their  knowledge  of  the  conditions 
under  which  the  subtasks  are  normally  performed,  the  behavior  involved  in 
performing  the  subtasks.  Information  from  technical  manuals  for  the  vehic- 
les, and  the  definitions  of  the  task  descriptors  shown  in  Appendix  G. 

/ 

On  completing  the  practice  ratings,  the  raters  discussed  points  of 
disagreement  and  made  notes  that  increased  the  clarity  and  precision  of 
the  definitions  of  the  task  descriptors.  All  tasks  for  each  duty  posi- 
tion in  each  of  the  three  tanks  were  then  rated  for  record  Independently 
by  the  two  raters.  Note  that  in  performing  this  final  round  of  ratings, 
the  judges  re-rated  the  18  tasks  that  they  had  rated  earlier. 

After  all  subtasks  in  a given  task  were  rated,  each  descriptor 
column  was  examined.  If  at  least  one  "l"  was  noted  in  the  column,  then 
a "1"  was  entered  in  same  descriptor  column  for  the  task.  The  one-zero 
entries  in  the  task  rows  of  the  two  raters'  data  sheets  were  used  to 
examine  inter-rater  reliability.  The  two  raters  later  reconciled  any 
differences  between  their  data  sheets,  producing  a uniform  set  of  one- 
zero  data  which  were  the  input  for  the  cluster  analyses. 

ANALYSES  AND  RESULTS 

s 

Two  kinds  of  analyses  were  done  using  the  data  generated  by  the  two 
raters : 


1.  Inter-rater  reliability  analyses,  to  determine: 

A.  The  extent  of  asreement  between  the  — ' 

two  raters  In  characterizing  the 

tasks . 

B.  Whether  the  discussions  between  the 
raters  after  rating  the  18  practice 
tasks  improved  agreement  on  their 
ratings  for  record. 

2.  Cluster  analyses,  to  identify  skills,  or  clusters 

of  tasks  with  descriptor  patterns  that  were  dissimilar 
among  clusters  and  similar  within  clusters. 


Inter-rater  Reliability  \ ^ 

The  extent  of  agreement  betwJen  the  two  raters  was  studied  in  two 
stages.  The  first  stage  used  the  ratings  of  the  18  practice  tasks  men- 
tioned earlier.  Recall  that  the  18  practice  tasks  were  Interspersed 
among  226  M60A1  tasks  and  were  rated  for  record  after  the  practice  session 
by  the  same  two  raters  who  did  the  practice  ratings.  Two  sets  of  ratings 
were  therefore  available  for  the  18  practice  tasks:  the  practice  ratings, 
and  the  ratings  for  record  that  were  done  a month  after  the  practice  rat- 
ings, Recall  also  that  between  the  practice  ratings  and  the  ratings  for 
record  the  raters  discussed  points  of  disagreement  and  revised  the  defini- 
tions of  the  task  descriptors  for  Increased  precision  and  clarity.  A 
basis  was  thus  provided  for  examining  the  effects  of  the  raters'  discus- 
sion on  inter-rater  reliability. 


The  second  stage  of  the  Inter-rater  reliability  study  provided  an 
estimate  of  the  final  level  of  reliability  achieved.  After  all  tasks 
were  rated,  22  of  the  208  M60A1  tasks  that  were  not  rated  in  the  prac- 
tice session  were  selected  using  the  same  criteria  as  were  used  for  select- 
ihg  the  18  practice  tasks . The  ratings  for  the  22-task  sample  were  com- 
pared with  the  second  round  of  ratings  for  the  18-task  sample,  as  a means 
of  verifying  the  level  of  Inter-rater  reliability  attained  in  the  final 
round  of  ratings  for  the  18  practice  tasks,  and  of  checking  on  the  inde- 
pendence of  the  final  ratings  of  the  18  practice  tasks.  The  tasks  com- 
prising the  two  samples  are  presented  in  Appendixes  H and  I. 
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Inter- rater  reliability  was  estimated  conservatively,  using  a method  j 

that  did  not  count  a zero-zero  match  between  raters  as  an  agreement.  Phi 
coefficients  (^}  were  used  in  all  cases  as  the  index  of  inter-rater  relia- 
bility. Details  of  computation,  and  discussions  of  the  results  are  pre- 
sented in  Appendix  J. 

Inter-rater  reliability  for  the  18  tasks  rated  before  discussion 
was  .58,  and  after  discussion  .72.  The  Increase  was  significant  at  the  .05 
level. ^ Overall  inter-rater  reliabilities  for  all  tasks  rated  after  prac- 
tice were  about  .70.  This  is  far  in  excess  of  chance  expectancy,  and 
marginally  acceptable  in  a practical  sense.  Suggestions  for  improving  inter- 
rater  reliability  in  studies  of  this  kind  are  presented  in  Appendix  J. 

j 

Task  Clusters 

The  reconciled  one-zero  task  by  descriptor  data  were  analyzed  using 
a canned  cluster  analysis  program.^  The  program  tises  the  Direct  Cluster- 
ing algorithm,  which  is  discussed  further  in  Appendix  L. 

Eight  cluster  analyses  were  performed: 

1.  Across  duty  positions,  M60A1. 

2.  Across  duty  positions,  M48A5. 

3.  Across  duty  positions,  M60A3. 

4.  Across  duty  positions,  across  tanks. 

5.  Driver,  across  tanks. 

6.  Loader,  across  tanks. 

7.  Gunner,  across  tanks. 

8.  Tank  Commander,  across  tanks. 


^The  difference  was  evaliuited  statistically  \ising  a chi-square  type 
analysis  of  the  transformed  Fisher's  z correlation  (Hays,  1967,  p.  532). 

^Dixon,  W.J.,  OP.  cit. . 1975. 
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The  results  of  the  first  four  analyses  were  not  partlculatly  Instruc- 
tive.^ The  remaining  four  will  be  addressed  here.  The  reason  for  focus- 
ing on  the  last  four  of  the  analyses  is  threefold: 

1.  The  alternative,  analyzing  the  results  by  tank 
across  duty  position  was  not  particularly 
useful  from  a training-development  point  of 
view,  since  training  normally  is  done  by  duty 
position. 

2.  Tasks  that  arc  more  similar  within  than  among 
tanks  should  form  unique  clusters  in  the  analyses 
by  duty  position  across  tanks. 

3.  The  analyses  by  duty  position  across  tanks  should 
reveal  areas  and  degrees  of  task  similarity  across 
tanks. 

The  clusters  or  "skills"  for  each  duty  position,  their  titles,^ 
and  the  tasks  comprising  each  are  shown  in  Appendix  B.  Eighty  skills 
were  identified  — 21  for  the  Driver,  19  for  the  Loader,  20  for  the 
Gunner,  and  20  for  the  Tank  Commander.  Notice  that  several  of  the 
skills  (Driver's  Clusters  2,  5,  8,  9,  and  21,  for  example)  are  one- 
or  two-tank  clusters.  This  suggests  that  unique  skills  were  not  masked 
by  the  across-tank,  by  duty-position  cluster  solutions. 

The  cluster  titles  and  the  descriptor  patterns  that  characterized 
each  skill  are  shown  by  duty  position  in  Figures  A,  5,  6,  and  7.  In 
each  figure,  "X"  Indicates  that  the  descriptor  appeared  in  more  than  50 
percent  of  a cluster's  tasks,  and  "/"  Indicates  that  the  descriptor 
appeared  in  30  to  50  percent  of  a cluster's  tasks.  An  asterisk  after  a 
cluster  title  indicates  that  the  cluster  is  comprised  of  tasks  that  are 
functionally  dissimilar.  Lubricate  Kachineguns  (Loader's  Cluster  12), 
for  example,  contains  the  task,  "Install  Main  Gun  Breechblock"  (see 
Appendix  B) . The  occasional  quirks  in  cluster  composition  probably  came 
about  because  some  of  the  descriptors  were  not  sufficiently  "fine-grained" 
to  permit  discrimination  among  some  functionally  dissimilar  tasks;  that  is, 

^Presented  under  separate  cover  to  the  AXI/Fort  Knox  Field  Unit 
Chief . 

^How  cluster  titles  were  derived  is  discussed  in  Appendix  K. 


36 


TLS,  ir.'STirrs 

STIMULI  CONlRf.S  Mr.DIATlirC  PROCKSSKS  OVhPT  KHSPOSSCS 


soinpnooao  3o»  sxius 


8COS4A9  SOXJT?UrpT 


tfuoyvxdap  soijsx 


U»nox  *0 


(uo;39VjT<»)  tX®^S  *« 


spunos  XPq49A«uo); 


38*>nDaj  20  pur^so 


PSUOuuoiXAuo  opirii«uu>: 


wnicx 


iwenr.:»w.mfi«Ri 

(ccoQBKQUznra 


HESBfiiBauiHisaiai 
laania&iiBtaH 

— jggm 

BBB 

!■■■■■■■■■■ 

BBmsua 

^BB 
.JBB 

IBtSB 

— tat^a 

:BHBr-- 

I . 

_ ^BBbil 

'BBBfiBKlSfinBl 
mHaacjOGMul 


s=a^i 

a.  tfi 

s-'N  ly 

py.N^jvjat  1< 

«•  ^ ja}<is}&  5sJ[i2 

55  H < O 

»~'y»  •■» 

•m;0'>..*-  ft. 


Figure  4.  Descriptor  patterns  for  Driver  clusters. 


r 


‘yi; 

1 1 t t ! 1 

UL 

U 

u 

1 

I 

Aq  S3aoiU»y 

1 t 

x!x 

>-• 

‘x’ 

■yl 

1 1 

n 

r 

_.L 

' S>;.>r.iT,  •»/; 

□ 

l~’ 

u 

M 

r“ 

•Ci 

r 

1*^ 

Uh 

_j 

iJox-T^.’?  * tC 



1 

_r 

UU- 

. 

n 

x!x 

Wix 

1.:|x 

.. 

X 

X 

X 

>: 

X 

X 

•f./i  X 

i , 

r 1- 

>'. 

u 

ia 

X 

X 

1 

rt-T-- 

“1 

r 

n 

^pnapic  j.)<.oau  «?3t;opv  *k;: 

L 

up 

n 

n 

r 

9JUU3SXP  so^nups;-!  *LZ 

U-|_ 

“ 

p,>ou»  soiiruptJji  ♦0}^ 

“1 

UL 

"1 

tt;u»p90oju  **>/ 

« 

■ 

X 

n 

X 

>? 

X 

"1 

X 

te 

sloq.«.\i<  897;T3ur>p]  •tf; 

u 

n 

s^TjTs^'-rO  ’HZ 



F 

F 

X. 

\x 

X 

X 

X 

X 

(OJUt/TfPfA)  S3333i»a  *£2 

I 

r-J 

•»«*. 

L 

X 

X 

X 

X 

Kuo^s-x^^op  so:?i?;<  ‘XJ 

n 

_ 

: 

K9IIU  9»Kn  'or. 

-J 

r 

_ 

— J 

uofS^ujojux  <*^'•.1  *6T 

“1 

" 

□ 

_ 

1“ 

r 

n 

M 

?ap»Inouv  JO  sa;po<;  »^xi!Dns  ‘ST 

»< 

x| 

F 

^X, 

X 

X 

r-J 

□u 

n 

n 

n 

ouox  'iT 

H 

!_ 

J 

n 

r 

n 

n 

r 

SXOA3U03  y’JT23b«  DXqpxai’A  *9r 

'n 

n 

r 

i 

X 

X 

t 

u 

>•: 

u 

X 

SXOi3UO?  itUX33dS  pOXXJ  •?[ 

:< 

“1 

z\ 

X 

X. 

n 

X 

xr« 

iPl 

I— 1 

X 

!< 

pK 

X 

X 

'x 

S937.\f*p  ^uxansciw:  puu  8X3  put;  aacq  ’ex; 

xl  X. 

k- 

n 

1 

F 

I 

X 

j 

8997A9P  ^ux-irtOtTiu  puc  SX3  **^3  *ZT 

1 

‘'i" 

_ 

-1 

n 

L 

n 

n 

n 

j 

h-] 

r 

pojeiJIUT-jxos  'Ttj  " 

X 

XiXjXi 

X 

X 

X 

*01! 

r 

T 

-L 

■*fc. 

;< 

X h: 

X 

(si»am*apt5i)  lo.ij  A|>i<ir  •(•,  1 

r 

n ~ 

1 

(uoxaoujxo)  ITKus  •«  1 

U- 

u 

u 

kJ 

spu:)08  xi;qjaA-'JO\<  •£  } 

J j 

F 

X 

n 

383n3a;s  ao  puru-aoa  xtuo  *9  1^ 

•*• 

'-.‘xlx 

X 

n 

n 

sr»an3B0^  XV3Ud'.uuoa;AUd  9pc-.u-u\;K  *5  i | 

= -J 

u 

M 

89an3vaj  X'^3U--iU'jax«vua  x^*^'‘‘-^K  ‘V  ' 

! 1 ; 1 

n 

' 

— 1 

si^no-pcoi  3tio-:nu3!*’i;*  *f  1 

•'•i 

-4 

XPXai>38i-  at*t*»qB3/.'>;i;Tt»a;)  ’y  i ' 

Ui— 1 

‘x-;S' 

VtlAaiBsi  ^T«n3X0]^  «a33Ti.'-.  t i 

L-J 

iFt 

XiX 

ri, 

KO.  OF 
TASKS  IK 
CLl’STER 

vT 

CO 

-■* 

•t 

fi'o 

-•» 

u- 

!2 

s 

C 

la. 

fu 

§ 

Ul 

u 

Hi 

0 

i--' 

i 

r_ 

(K 

. 

. J 

u 

<n 

L. 

”1 

- 

O' 

-i 

I”' 

en 

O' 

. 

r 

‘i£ 

o 

< 

3 

U 

3 

2 

< 

X 

p 

a 

Ui 

cw 

•-* 

C3 

U 

3 

r, 

cd 

b< 

0 

1 

W 

< 

(A 

mI 

< 

vJ 

5 

H 

i 

u< 

hi 

a 

M 

- 

9 

i 

. 

1 

uj 

a< 

Q 

m 

o 

b< 

o 

*-( 

a 

H 

•< 

(A 

I 

1 

i 

a 

•A 

* 

tf. 

h 

p! 

D 

< 

U) 

> 

« 

s 

Lri 

M 

U*. 

!3 

S 

Hi 

H 

12 

i 

{/•. 

1 

« 

B 

” 

/J 

p 

s 

H 

I 

V; 

»•* 

. 

9^ 

u 

06 

M 

n 

M 

s 

fti 

o 

. 

2 

i 

a: 

iu 

u. 

0 

£ 

V) 

H 

1 
8 

U 

X 

< 

f- 

v: 

C 

K 

3 

c>i 

Aa 

1 

tc 

V) 

'(’ 

u 

u: 

X 

M 

0 

u 

X 

2 

i 

o 

ns 

o 

Uu 

g 

5 

;U 

X 

o 

IV 

■a 

a. 

-H 

(A 

q 

G 

o 

b. 

(O 

X 

OS 

Ua 

t 

0 

H 

1 

U 

OS 

< 

IV 

u; 

os 

X 

'.n 

X 

o 

X 

V-a 

g 

ii 

o 

S 

UI 

n. 

5 

•« 

2 

*"> 

X 

w 

(A 

Q 

s 

u 

iu 

G 

o 

s 

EU 

H 

X 

0 

b. 

a: 

lu 

fu 

1 

r 

, 

i 

UI 

> 

s 

H 

* 

M 

1 
:: 
o 

u: 

Oa 

2 

lA 

U 

H 

Su 

O 

3 

a 

g 

jr, 

U") 

t. 

Hi 

D 

< 

•j 

S 

in 

2 

M 

■«v 

UJ 

?% 

in 

w 

«c 

in 

8 

bl 

X 

U 

o 

IM 

uii 

si 

38 


Figure  5.  Descriptor  patterns  for  Loader  clusters. 
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Figure  7.  Descriptor  patterns  for  Tank  Commander  clusters. 


some  descriptors  (natural  and  environmental  features,  for  example)  were 
sc  broad  that  tasks  that  were  quite  dissimilar  operationally  could  have 
had  identical  or  very  similar  descriptor  patterns.  The  fact  that  this 
happened  as  seldom  as  it  did  is  encouraging:  the  tasks  comprising  each 
cluster  do,  on  t^e  whole,  seem  to  "go  together"  operationally  or  func- 
tionally. 


Narrative  descriptions  of  a sample  of  the  skills  and  a few  repre- 
sentative tasks  are  shown  in  Figures  6,  9,  10  and  11.  Hew  the  narratives 
were  formed  is  discussed  in  Appendix  L. 


The  results  of  the  cluster  analysis  revealed  some  task  clusters  that 
were  vmlque  to  a particular  vehicle,  and  yielded  cluster  profiles  that 
enable  comparisons  among  skills  for  the  different  duty  positions.  More 
generally  the  results  suggested  that,  in  terms  of  the  descriptors  used, 
there  tends  to  be  greater  similarity  across  vehicles  in  tasks  performed 
than  there  is  between  functional  categories  of  tasks  within  a vehicle. 

In  other  words,  tasks  representing  similar  tank  operations  tended  to 
cluster  together  regardless  of  which  tank  they  are  performed  on. 


One  can,  in  retrospect,  think  of  several  ways  that  the  descriptors 
could  be  changed  for  more  desirable  cluster  definitions.  Task  complexity 
or  difficulty  is  not  reflected  in  the  descriptors  as  well  as  it  could 
have  been;  for  example,  the  stimulus  descriptor  "man-made  environmental 
features,"  would  be  checked  in  one  Instance  for  a white  panel  boreslght 
target,  and  in  another  instance  for  an  obscured  tank  target  to  be  iden- 
tified and  fired  on  with  the  main  gim.  Or  a "variable  control"  could  in 
one  case  refer  to  a dial  to  be  set,  and  in  another  case  to  the  Gimner's 
tracking  control  handle. 


Some  of  the  characteristics  that  separated  the  clusters  probably  are 
not  as  Important  as  others  for  training  development  purposes;  on-off  controls, 
versus  fixed  setting  controls,  for  example.  And  one  can  think  of  some 
descriptors  that  probably  should  have  been  added;  for  example,  a descrip- 
tor or  descriptors  that  separated  reactive  or  highly  time-constrained 
tasks  from  those  that  are  not.  But  selecting  the  "best"  set  of  descriptors 
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i DRIVER  CLUSTER  1;  INSTALL  .'^ND  Rn>10VS  EQUIPMENT 

I Performs  fixed  procedure  hand-arm  laanipuiatlon  of  on-off  or  open- 
close  controls  and  sometimes  common  hand  tools  in  voluntary 
response  to  scheduled  operations. 

Sample  Tasks; 

. Install  the  M27  periscope. 

. Remove  the  WS2  Driver's  viewer. 

DRIVER  CLUSTER  16:  DRIVE  TACTICALLY 

Performs  continuous  steering  and  multilimb  manipulation  of  variable 
controls  in  voluntary  response  to  oral  commands  and  environmental 
features  by  recalling  facts,  making  decisions,  and  classifying 
information. 

Sample  Tasks; 

. Perform  evasive  maneuvers  upon  enemy  contact. 

. Move  vehicle  into  defilade  firing  position  upon  enemy  contact. 

Figure  8.  Sample  Driver  clusters,  narrative 

descriptions,  and  representative  tasks. 


' LOADER  CLUSTER  7;  PERFORM  MISFIRE/IMMEDi;iTE  ACTION  PROCEDURES 
I Performs  fixed  procedure  finger-hand- arm  manipulation  of  special 
I tools  and  on-off  and  fixed  setting  controls  in  response  to  oral 
! command  and  sometimes  touch  by  detecting  information. 

S 

Sample  Tasks; 

. Apply  immediate  action  to  reduce  a stoppage  of  the  M219 
machlnegim. 

. Unload  misfired  main  gun  round. 

LOADER  CLUSTER  15:  PERFORM  MAINTENANCE  CHECKS  AND  SERVICES 

Performs  fixed  procedure  hand-arm  manipulation  of  common  tools  in 
response  sometimes  to  either  oral  command  or  written  technical 
guidance  and  touch  by  detecting  and  sometimes  recalling  informa- 
tion. Reports  orally. 

Sample  Tasks; 

. Perform  at-halt  checks  on  engine  and  transmission  oil  levels. 

. Perform  after-operations  checks  on  final  drives. 

Figure  9.  Sample  Loader  clusters,  narrative 

descriptions,  and  representative  teisks. 


! GUNNER  CLUSTER  1:  ENGAGE  TARGETS 

Performs  continuous,  sometimes  compensatory,  and  fixed  procedure 
finger-hand-arm  manipulation  of  various  controls  in  response  to  an 
oral  command  and  to  man-made  environmental  features  by  detecting, 
recalling,  and  classifying  information  while  communicating  orally. 

Sample  Tasks;  ! 

. Gunner  fires  main  gun  battleslght  engagement  using  the  GPD 
(stationary /moving) . 

. Gunner  fires  main  gun  precision  engagement  using  the  TEL 
(stationary /moving) . 

GUNNER  CLUSTER  7:  CONDUCT  FIRE  CONTROL  INSTRUMENT  CHECKOUT 

Perform.s  fixed  procedure  hand-arm  manipulation  of  various  controls 
in  voluntary  response  to  instrument  readouts  and  sometimes  to  touch 
by  detecting,  recalling,  and  classifying  information;  sometimes 
reports  orally.  , 

Sample  Tasks; 

. Place  ballistic  computer  into  operation. 

. Perform  Laser  Rangefinder  (LRF)  malfunction  detection  test. 


Figure  10.  Sample  Gunner  clusters,  narrative 

descriptions,  and  representative  tasks. 


TANK  COMMANDER  CLUSTER  6;  PERFORM  TACTICAL  GUNNERY  PROCEDURES 

Communicates  orally  and  performs  continuous  steering  and  fixed  pro- 
cedure finger-hand- arm  manipulation  of  on-off  or  open-close  controls, 
variable  setting  controls,  and  sometimes  fixed  setting  controls  in 
voluntary  response  to  man-made  environmental  features , and  Instrument 
read-outs,  by  recalling  facts,  making  decisions,  detecting,  and 
classifying  Information. 

Sample  Tasks; 

. TC  fires  main  gun  battleslght  engagement  using  the  RFD 
(stationary /stationary) . 

. TC  fires  caliber  .50  engagement  using  the  TPI  (stationary/ 
moving) . 

TANK  COMMANDER  CLUSTER  19:  INSTALL  AND  MAINTAIN  OPTICAL  EQUIPMENT 

Performs  hand-arm  manipulation  of  on-off  controls  or  variable  setting 
controls  in  voluntary  response  to  scheduled  operations,  written 
technical  guidance,  instrument  read-outs,  or  natural  environmental 
features  by  detecting  information  and  sometimes  recalling  set 
procedures . 

Sample  Tasks : 

. Install  periscope  M36E1  head  assembly. 

k Perform  after-operations  maintenance  checks  and  services  on 
periscope  M36E1. 


Figure  11.  Sample  Tank  Commander  clusters,  narrative 
descriptions,  and  representative  tasks. 


on  an  a priori  basis  probably  is  not  possible.  The  test  of  the  adequacy 
of  the  cluster  solution  used  here  will  be  in  the  utility  of  the  results 
for  designing  training  in  Task  2. 

CONCLUSIONS 


1.  The  results  of  inter-rater  reliability  studies  with  two  Judges 
characterizing  armor  tasks  in  terms  of  36  descriptors  indicated 
that: 

A.  Inter-rater  reliability  Increased  significantly  with 
practice  and  discussion,  irrespective  of  whether  the 
tasks  rated  for  record  were  the  same  as  or  different 
from  the  tasks  rated  for  practice. 

B.  Overall  inter-rater  reliabilities  for  the  tasks 
rated  after  practice  were  about  .70. 

2.  Increases  in  inter-rater  reliability  greater  than  those  obtained  in 

the  present  studies  probably  could  have  been  achieved  with: 

A.  Increased  precision  and  clarity  of  the  descriptor 
definitions . 


3.  More  practice. 

C.  More  access  to  operational  equipment,  as  a means  of 
verifying  Information  obtained  from  technical 
manuals  and  experts. 

3.  Cluster  analysis  was,  with  few  exceptions,  effective  in  sorting  tasks 
according  to  common  mission  operations.  Occasional  peculiarities 
in  cluster  composition  occurred,  probably  because  some  of  the  descrip- 
tors were  not  sufficiently  "fine-grained"  to  permit  discrimination 
among  some  dissimilar  tasks.  Increased  cluster  homogeneity  might 


t i 


be  achieved  with  the  addition  of  some  descriptors  that  reflect  task 


difficulty  or  complexity,  and  others  that  would  separate  reactive 
or  highly  time-constrained  tasks  from  those  that  are  not. 

4.  The  utility  of  cluster  analysis  for  training  design  has  only  begun  to 
be  explored.  Several  iterations  of  the  kinds  of  analyses  reported  here 
will  be  required  before  the  most  useful  set  of  task  descriptors  for 
training  development  is  found.  Additional  data  treatments  also  should 
be  explored.  Cluster  analyses  based  only  on  stimulus  descriptors, 
for  example,  might  yield  more  obvious  implications  for  media  and 
device  selection  than  will  the  results  reported  here. 
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SKILL  CRITICALITY,  LEARNING  DIFFICULTY,  j 

AND  EVALUATION  DIFFICULTY  \ 

\ 

I 

The  final  part  of  exploring  new  treatments  of  task  data  was  an  | 

attempt  to  determine  the  criticality,  learning  difficulty,  and 
I evaluation  difficulty  of  each  of  the  task  clusters  or  skills  Iden-  j 

tlfied  earlier.  I 


SKILL  CRITICALITY 

The  criticality  of  each  task  cluster  was  computed  as  the  mean 
criticality  for  the  tasks  In  the  cluster.  The  summary  values  for 
each  cluster  are  shown  in  Tables  4 through  7,  and  in  Appendix  B. 

Though  Informative  in  a descriptive  sense,  cluster  criticality  seems 
not  particularly  useful  from  the  standpoint  of  training  development. 
Criticality  Is  useful  chiefly  In  establishing  training  priorities; 
and  to  the  extent  that  training  programs  are  geared  ultimately  to 
tasks,  it  Is  task  criticality  that  matters.  The  integrity  of  ^ cluster. 

In  terms  of  Its  behavioral  characteristics,  would  not  be  materially 
altered  by  omitting  one  or  two  tasks,  but  its  average  criticality 
could  be.  Having  obtained  the  values  by  task,  however,  enables  one  to 
calculate  the  criticality  of  any  configurations  of  tasks  that  might 
comprise  a training  module. 

LEARNING  AND  EVALUATION  DIFFICULTY 

Learning  difficulty  and  evaluation  difficulty  for  the  domain  of 
tank  crew  behavior  associated  with  each  descriptor  were  rated  by  five 
members  of  the  project  staff.  The  estimates  for  each  descriptor  were 
averaged  across  raters.  Difficulty  estimates  for  each  skill  or  cluster 
were  then  made  by  adding  the  descriptor  scores  for  the  modal  descriptor 
pattern  for  each  task  cluster.  The  sums  were  converted  to  standardized 
scales  for  learning  and  evaluation  difficulty,  each  with  a mean  of  5.0 
and  standard  deviation  of  1.0,  the  same  standard  scale  as  was  used  for 
the  criticality  ratings.  Additional  details  of  the  methodology  for 
estimating  learning  and  evdluatlon  difficulty  are  presented  In  Appendix  M. 


45 


SKILL  CRITICALITY,  LEARNING  DIFFICULTY,  AND  EVALUATION  DIFFICULTY:  DRIVER 


rr 

iJia  TV  All 



iiia 


UllVDlXUD 


SHSVi  iO  « 


M «A  tn 

< < < 

S2?3 

>:  X X 


< ^ < 

S5S 


-<  < 
22 
X X 


^ w-» 

< < < 

O CO  o 

«£>  «r 
X X X 


M ‘#'1 
< *5  < 
o «o  o 
'>r  « 
>:  X X 


c/5 

o 

M 

M M 

^ otf 
CO  o 
CO 

Q 


w 


&; 
3 t 


S::S 


« ^ o 

4^  C ^ 
> 

C <U  <A 

■s:5e 


2 . 
« s 

§:; 

>» 


W ^ 
•*  « 
Ob  3 
O «M 


o>  y 

0 w • 

1 d M 
V ^ « 

o "O  tJ 

e 6 


■Ha 


E » 

o e 


u 

C O 3 
o u *o 

•r*  « 

w (A  X 
<9  V U 
^ 2 • 
3 <«4 
O.  *i  O 
«4  41  M 
C 

ts  o y 

a (0 


1.  ® 
O,  4) 

O W 


o 

W *4 


? o S g 


L ‘ ^ 

3 O C 

a.  o m 

•f*  $4 

B M a d 

«M  a <*4 


«M  W X « 

^ «9  U « 


^ Q ■* 
9 W pH 

a.  c u 

3 


g . 

it  ro 

E5  Si 

•9  fl>  X 
» «J  ^ PH 
« 

e o a M 

«9  W 60 
X W X 

C PH  X 
« o ^ 

t 2 

o «t  c ^ 
(0  X O 
O U Ob 
it  p4  « « 
A.  W W flg 


<9  o <9  «•* 

t w 3 

M (A  a a 

C PH  U 

rj  o i«  ^ 
X ^ C PH 
I U <9 

M c o a 

i>  O 4i  44 

to  U M 
e a it 
•H  it  M 41 
«M  « C 

o e 

a PH  o.  e 
M u M a 

a I it  44 

Y g 

u o.  o u 

O O hH  ) 
44  *>• 

o.«44  m • 


a I 44  4« 


t e c PH  • 
e o H«  • X 

%«  •>«  44  44  u 


4M  <44  « 
(m  44  it 
O 44  p4 


i a " 

•H  6 


O 44  O 44 
HH  « O. 

44  fi  tn  10 
a o it  if  * 

PH  a 44  a 


U Qt  60 

<9  H e 

44  O VH 
“ • X 


Ei-s: 


§ « >> 


we  *0  ‘ 

8 


a 

y V 

a w 


ti  « 

pH  <*4  3 a 

a 44  a 
Cb  it  o M 

PH  ^ 44  O 
6 0 144 

a a it  e 
a a vH 
*o  e 

£ s a? 

a a vH 
t a it  44 
■d  PH  44  u 

c p it 
IQ  c e w 

a o a 

EW  pH  X 
O X 

p a ^ 

•44  a 44  a 

44  o c u 
« PH  o 6 
A«  14  o a 


a 44 

44  3 

3 a 

*D  a 

it  it 
V B 


, 8 

a 44 


6 it  PH 
hH  if  O • 

c oc 
a a a 6 

pH  *0  it  pH 
O pH  a 44 
44  3 pH  pH 
44  00  44  U 

8-  S * 

it  a o 6 

it  a pH 


0 X e 44 
pH  it  a M 

V a o 

1 44  60  Cb 

e 6 a 

it  a P4  « 

CL  it  44 
O 44  it 
X,  44  « • 

•M  p4  44  e 
•M  kH  it  O 
O 3 *0  pH 


3 a 60 
P>  6 6 

p4  O pP 
C C4  W 


it  X e 

CO  M pH 

o a 


&i4  C 
MpH 


7 a- 


a o 6 o X a 


O ns  fH 

Up  a 3 
4«  K X 
a pH  - 


•44  44  I 


vsisnis 


§R 

M M 
k. 

pS 


u 

4P 


M M 

§g 

A*  X a) 
O M 

!iS5 

H W > 

!a;^:e 

gigs 


Vi  tu 
M Ct 
X X 


s 

S 

B 

o 


46 


Table  4 (Continued) 
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Table  4 (Continued) 


Table  5 

SKILL  CRITICALITY,  LEARNING  DIFFICULTY,  AND  EVALUATION  DIFFICULTY:  LOADER 


Table  5 (Continued) 
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SKILL  CRITICALITY,  LFARNING  DIFFICULTY,  AND  EVALUATION  DIFFICULTY:  GUtCNER 
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RESULTS  AND  DISCUSSION 


The  learning  and  evaluation  difficulty  escimatos  for  each  skill 
are  presented  in  Tables  ^ through  7.  Inter-rater  reliability  was 
estimated  h}  an  analysis  of  variance  of  the  rater  by  descriptor  data 
matrix.^  Intraclass  correlations  were  .76  for  learning  difficulty 
and  .88  for  evaluation  difficulty,  indicating  fairly  high  reliability 
of  the  average  of  the  five  sets  of  ratings.  (Each  coefficient  Indicates 
the  hypothetical  correlation  chat  would  obtain  between  Che  average  rat- 
ings for  this  set  of  five  racers  and  those  from  another  random  sample 
of  five  raters.)  If  it  is  assumed,  however,  that  the  raters  differed 
systematically  in  their  frames  of  reference  for  Judging  the  descriptors, 
then  Che  reported  correlations  are  underestimates  of  inter-rater  relia- 
bility. When  the  data  are  corrected  for  differences  among  rater  means, 
reliabilities  the  mean  ratings  are  .83  for  learning  difficulty,  and  .89 
for  evaluation  difficulty. 

Averages  of  the  learning  and  evaluation  difficulty  scale  values 
were  computed  across  the  skills  in  each  duty  position.  These  means, 
presented  in  Figure  11,  indicate  that  the  skills  required  for  the  Tank 
Commander's  position  are  the  most  difficult  for  learning  and  for  evalu- 
ftlon,  followed  by  the  Gunner,  Driver,  and  Loader  on  both  dimensions. 
These  findings  supported  Che  expectations  of  Che  relative  learning  and 
evaluation  difficulties  of  skills  among  the  four  duty  po.sltions.  Fig- 
ure 12  also  presents  tasks  representative  of  those  skills  which  received 
the  highest  and  lowest  difficulty  scores  in  each  duty  position.  The 
same  skills  appeared  at  the  extremes  of  both  dimensions  in  each  of  the 
four  duty  positions. 

The  results  of  the  learning  and  evaluation  difficulty  study  seemed 

* 

In  some  cases  to  be  at  odds  with  reality.  Driver's  Cluster  20  "Start 
tank  engine,"  tor  example,  received  an  evaluation  difficulty  rating  that 


^ Winer,  B.J.  Statistical  Principles  in  Experimental  Design.  New  York, 
New  York;  McCraw-Hill,  1962. 
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Figure  12.  Representative  skills  and  tasks  at  the 

extremes  In  learning  and  evaluation  difficulty. 
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Representative  skills  and  Casks  at  the 
extremes  In  learning  and  evaluation 
difficulty. 
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was  i&ore  than  two  standard  deviations  above  the  mean.  Such  apparent 
abberatlons  probably  occurcd  for  either  or  both  of  two  reasons.  The 
first  is  that  the  method  for  computing  cluster  difficulty  was  additive. 
(Recall  that  difficulty  was  computed  by  summing  the  difficulty  values 
for  descriptors  that  predominated  each  cluster.)  The  sum  of  the  values 
rather  than  the  mean  was  used,  on  the  assumption  that  the  greater  the 
number  of  descriptors  required  to  characterize  the  cluster,  the  greater 
the  cluster's  complexity,  and  therefore  the  greater  its  difficulty  of 
evaluation  and  learning.  This  assumption  may  have  been  erroneous. 

Another  possible  reason  for  the  apparent  abberatlons  is  simply  that 
I some  of  the  cluster  names  do  not  describe  the  tasks  comprising  the 

cluster  very  well.  This  Is  especially  true  for  the  asterisked  clusters, 
which  were  comprised  of  tasks  related  to  more  than  one  mission  operation, 
but  which  were  named  In  terms  of  only  one  mission  operation.  The  abberant 
Driver's  Cluster  20  mentioned  above  Is,  In  fact,  one  of  the  asterisked 
clusters.  It  is  comprised,  not  only  of  tasks  related  to  starting  the 
engine,  but  also  of  operating  a tank  across  a water  obstacle,  driving 
over  varied  terrain,  and  performing  main  gun  prepare-to-flre  procedures  — 
tasks  that  may  Indeed  be  extremely  difficult  to  evaluate.  Time  and 
other  resources  unfortunately  did  not  permit  exploring  other  ways  of 
computing  cluster  difficulty  that  might  have  produced  results  different 
from  those  obtained.  Summing  the  descriptor  difficulty  values  for  each 
cask,  for  example,  and  then  averaging  the  task  values  within  each  cluster 
would  be  Interesting. 

As  was  the  case  with  the  criticality  ratings,  a question  can  be 
raised  about  the  extent  to  which  learning  difficulty  and  evaluation 
difficulty  were  rated  Independently  of  other  constructs  (criticality, 
for  example).  The  extent  to  which  learning  difficulty  and  evaluation 
difficulty  are  independent  of  one  another  also  may  be  of  interest.  These 
’ are,  of  course,  questions  of  construct  validity  and  could  be  examined 

using  a plan  analogous  to  the  one  presented  for  the  criticality  ratings 
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(see  Appendix  F) . Construct  validity  also  can  be  examined,  albeit  ten- 
tatively, by  correlating  some  scores  from  the  present  study.  The 
learning  and  evaluation  difficulty  estimates  for  the  32  descriptions 
were  highly  correlated  (r  - .76).  This  may  Indicate  that  skills  that 
are  difficult  Co  learn  also  are  difficult  to  evaluate.  But  the  learning 
and  evaluation  difficulty  values  were  generated  on  the  basis  of  scores 
from  the  same  group  of  raters.  The  high  correlation  may,  therefore,  be 
a measurement  artifact:  The  two  constructs  may  have  been  related  in  the 
Judgment  of  the  raters,  but  not  in  fact. 


Ocher  corrolstlons  bearing  on  the  issue  of  construct  validity  are 
shown  In  Table  8.  The  correlations  between  learning  difficulty  and 
criticality,  and  between  evaluation  difficulty  and  criticality  averaged 
.44.  As  was  the  case  for  the  correlation  between  learning  and  evalua- 
tion difficulty,  the  correlations  may  reflect  a "real"  relationship,  or 
systematic  bias  In  the  ratings  (or  both).  The  criticality  estimates  and 
the  difficulty  estimates  were,  however,  (a)  generated  from  ratings  by 
two  independent  sets  of  Judges  (Captains  and  project  staff  members) , and 
(b)  measured  differently  f^om  one  another.  This  suggests  that  the  con- 
structs are  related  in  fact  rather  than  only  in  the  Judgment  of  the 
raters.  Why  criticality  and  difficulty  would  be  related  is  not  clear. 
Designers  of  tank  systems  may,  because  of  space,  hardware,  or  money 
limitations,  allocate  the  most  critical  system  functions  (detecting  and 
fracking  targets,  for  example)  to  men  rather  than  machines  — and  these 
critical  functions  may  Indeed  be  the  most  difficult  to  learn  and  evalu- 
ate. 


CONCLUSIONS 


1.  The  cluster  criticality  estimates,  which  were  averages  of  the  criti- 
cality valuds  for  the  tasks  comprising  each  cluster,  probably  will 
not  be  as  useful  in  training  design  as  the  criticality  values  for 
individual  tasks  will  be. 

The  estimates  of  learning  evaluation  and  difficulty  were  highly 
reliable  In  terms  of  the  stability  of  the  mean  ratings  obtained. 


2. 
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. The  results  of  Che  learning  and  difficulty  studies  were  inconclu- 
sive. Some  of  the  results  seemed  at  odds  with  reality.  This  may 
have  been  because  of  deficiencies  in  methods  for  computing  diffi- 
culty, because  some  of  the  clusters  were  named  inappropriately,  or 
both.  The  results  reported  here  can  b«  verified  via  additional 
treatments  of  the  obtained  data  (computing  difficulty  values  for 
each  cask,  and  averaging  Cask  values  within  each  cluster,  for 
example) , oy  by  conducting  additional  research  (paired  comparison 
studies  of  Cask  difficulty,  for  example). 

4.  The  estimates  of  learning  difficulty  and  evaluation  dlfficplty  were 
highly  correlated.  Skills  that  are  difficult  to  learn  may  tend  Co 
be  difficult  to  evaluate  also.  The  possibility  of  measurement  error 
remains,  however,  and  may  be  examined  using  designs  similar  to  the 
one  presented  in  Appendix  F. 

5.  The  estimates  of  learning  difficulty  and  evaluation  difficulty  each 
correlated  on  an  average  of  .44  with  the  criticality  estimates.  The 
suggestion  was  offered  that  criticality  and  difficulty  may  in  fact 
be  related  because  of  system  design  practices  that  assign  more 
critical  and  difficult  system  functions  to  men  rather  than  to 
machines . 
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APPENDIX  A 

METHOD  FOR  GENERATING  THE  TASK  LISTS 


METHOD  FOR  GENERATING  THE  TASK  LISTS 


N60A1  TASK  LIST 

Three  data  sources  were  used  In  generating  the  M60A1  task  and 
sub task  list  (see  Table  1,  p.  7 ).  The  main  data  source  for  the 
M60A1  list  was  a set  of  job  task  data  cards  for  the  critical  and  impor- 
tant communications,  machinegun,  and  tracked  vehicle  tasks,  as  indicated 
in  the  HE  task  list,  and  supplied  by  the  Job  and  Task  Analysis  Branch, 
Directorate  of  Training  Developments,  U.S.  Army  Armor  School,  Fort  Knox, 
Keptucky  (1976).  Task  data  and  criticality  ratings  from  the  Armor 
School  were  supplemented  by  task  data  and  criticality  ratings  from  a 
second  source.  Performance  Measures  for  AIT  Armor  Crewmen.^ 

Gunnery  tasks  for  the  M60A1  list  were  obtained  from  a third  source. 
Boldovlcl,  Wheaton,  and  Boycan^  attempted  to  define  all  tasks  encompassed 
by  M60A1(A0S)  gunnery.^  Since  the  task  lists  in  that  study  seemed  more 
comprehensive  than  any  available  others,  they  were  used  to  sample 
gunnery  tasks  for  use  in  the  present  project.  Two  criteria  were  used  for 
selecting  the  gunnery  tasks  — comprehensiveness  and  representativeness. 

Comprehensiveness  refers  to  the  extent  to  which  the  g>ir.r.try  tasks 
as  a group  cover  the  gunnery  domain,  as  represented  in  Table  A.l. 
Representativeness  refers  to  the  extent  to  which  a task  in  each  cell  of 
the  domain  subsumes  elements  or  subtasks  of  other  tasks  in  the  same  cell. 


^Ford,  J.P.,  Harris,  J.H.,  and  Rondiac,  P.F.  Performance  Measures  for 
AIT  Armor  Crewmen.  Fort  Knox,  Kentucky:  Human  Resources  Research 
Organization  (HumRRO) , 1974. 

^Boldovici,  J.A.,  Wheaton,  G.R. , and  Boycan,  G.G.,  o£.  cit. . 1976. 

^This  study  updated  an  earlier  attempt  at  domain  definition  by  Kraemer, 
Boldovlcl,  and  Boycan  (1975). 


Table  A.l 


LOCATIONS  IN  THE  GUNNERY  DOMAIN,  OF  TASKS 
USED  IN  THIS  PROJECT 
(Each  "X"  represencs  one  task.) 


Battlestght  (non-pre- 
cision for  nachineguns)  X XX  X X X 


Precision  X XX  X X 

I XX 

Range  Card 

Range  card  Lay  to 
Direct  Fire 


Preliminary  results  from  the  Boldovlcl,  Wheaton,  and  Boycan^  study 
identified  those  gunnery  tasks  that  were  most  comprehensive  and  representa- 
tive of  the  M60A1(A0S)  gunnery  domain.  Their  locations  in  the  domain 
are  shown  in  Table  A.l.  The  17  gunnery  tasks  were  modified  to  Incorporate 
a stationary  firing  vehicle,  and  became  part  of  the  M60A1  task  list  for 
the  present  project.^ 

M48A5  TASK  LIST 

Generating  the  H48A5  list  began  with  a review  of  the  M60A1  list. 

All  tasks  that  were  rated  critical  or  Important  for  the  M60A1  In  the 
sources  described  earlier,  and  that  would  be  performed  by  M48A5  crew 
members,  were  considered  also  to  be  critical  or  Important  for  the 
M48A5  and  were  Included  in  the  M48A5  list.  The  MSOAl-based  list  for 
the  M48AS  was  expanded  In  two  ways: 

1.  The  M48AS  Operator’s  Manual  was  reviewed. 

Whenever  a task  was  found  that  was  performed 
by  an  M48A5  crew  member,  but  not  by  an  M60A1 
crew  member,  we  made  a judgment  about  the 
criticality  or  importance  of  the  task.  If  It 
was  judged  critical  or  important,  the  task 
was  added  to  the  M48A5  list. 

2.  The  gunnery  tasks  that  were  included  in  the 
M48A5  list  were  the  same  as  the  gunnery  tasks 
for  the  M60A1.  They  were  the  set  of  tasks, 
modified  to  Incorporate  target  engagements 
from  a stationary  firing  vehicle,  which  accord- 
ing to  the  Boldovicl,  Wheaton,  and  Boycan 
report  were  most  comprehensive  and  representative 
tasks  In  the  K60A1(AOS)  gwmery  domain. 


The  M48A5  task  list  Included  22  more  tasks  than  the  M60A1  list  did. 

These  were  tasks  which  the  project  staff  judged  Important  or  critical,  | 

but  which  were  not  in  the  HE  most-critical  and  Important  lists  supplied  ! 

i 

by  the  Armor  School.  Examples  of  the  added  tasks  Included,  "Check 
track  tension,"  "Connect  track,"  and  "Zero  H2  machlnegun." 

_ i 

^Boldovicl,  J.A. , Wheaton,  G.R. , and  Boycan,  G.G.,  op.  clt. , 1976. 

^The  H60A1  task  and  subtask  lists  have  been  presented  under  separate 
cover.  (See  Harris,  J.H.,  O'Brien,  R.E.,  Campbell,  R.C.,  and 
Ford,  J.P.,  1976.)  | 

i 


■ gy  gj'c’ 


M60A3  TASK  LIST 


The  M60A3  fd.ll  be  the  production  version  of  the  experimental  M60A1E3. 
Because  of  uncertainty  about  which  product  Improvements  will  be  incorpor- 
ated into  the  M60A3,  some  guesswork  was  required  in  generating  the  task 
list  for  this  tank. 

As  with  the  M46A5,  the  task  list  for  the  M60A1  was  used  as  a 
starting  point  for  generating  the  list  for  the  M60A3.  Any  M60A1  task 
that  was  also  performed  by  an  M60A3  crew  member,  and  was  rated  critical 
or  important  for  the  M60A1,  was  Included  in  the  M60A3  list.  Gunnery 
tasks  were  the  ones  designated  most  comprehensive  and  representative  in 
the  study  by  Boldovlci,  Wheaton,  and  Boycan.^  And  the  H60A1E3 
Operator's  Manual  was  reviewed  to  identify  tasks  which  seemed  critical 
or  important  to  the  project  staff,  but  had  not  appeared  in  the  HE  task 
list. 


Best  guesses  had  to  be  made,  based  on  interviews  with  authorities 
at  Fort  Knox,  and  on  reviews  of  product  Improvement  literature,  about 
the  final  configuration  of  the  M60A3.  Task  lists  were  then  written  for 
the  operation  and  maintenance  of  chose  components  that  seemed  most  likely 
to  be  Incorporated  into  the  production  K60A3. 


The  M60A3  Cask  list  that  evolved  was  different  in  several  ways  from 
the  K60A1  Cask  list: 

1.  The  M60A3  gunnery  tasks  Included  precision 
engagements  from  moving  tanks  with  no 
requirement  to  come  to  a brief  halt  before* 
firing. 

2.  Tasks  were  written  to  reflect  the  following 
new  components,  which  are  likely  to  replace 
existing  ones  or  are  new  to  the  tank  inventory. 

A.  ‘ Laser  Rangefinder,  ANWG2  (new  component). 

B.  Electronic  Computer,  XM21  (new  component). 

C.  Light  Amplification  Sights,  M35E1,  M36E1 
(new  component  for  Tank  Commander,  replaces 
existing  periscope  for  Gunner) . 


^Boldovici,  J.A. , Wheaton,  G.R. , and  Bpycan,  G^G.,  0£^.  cit. . 1976. 
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D.  Tank  Thermal  Sight  (new  component) . 

E.  Smoke  Grenade  Launcher  (new  component). 

F.  Muzzle  Reference  System  (new  component). 

G.  mag-58  Coax  Machinegun  (replaces  M219  machlnegun) 

U.  Driver's  Viewer,  WS2  (replaces  Driver's  viewer, 
M27). 
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THOUELESHOOT  FiACHIKECUNS 


TASK  MO;  CUDSm  li;  lOKESlaiT  SEAUCHLICHT 


CtUSTTR  CtlTlCALITT:  «.5A1 


EXPLAMATIOM  OF  TASK  CODE  NUMBERS 


Each  Cask  was  IdsnClfled  by  a five-place  alpha-numeric  code.  The 
first  two  of  the  five  pieces  Identify  tasks  tihose  performance  Is  common 
or  unique  to  Che  tanks,  ss  shown  In  the  following  table: 


TANK  SYSTEMS 

Designators 

M60A1 

M60A1(AOS}^  M48A5 

M60A3 

AA 

X 

X X 

X 

AB  . 

X 

X X 

AD 

X 

X 

X 

AF 

X 

X 

AL 

X 

X 

AO 

X 

X 

A1 

X 

AS 

X ' 

A3 

X 

A5 

X 

AK 

X(NEW) 

Task  numbers  beginning  with  AA  indicate  tasks  whose  performance  Is  common 
to  all  four  tanks;  those  beginning  with  A1  are  unique  to  the  M60A1,  and 
so  forth. 

The  third  place  In  the  code  Is  a nimeral  Indicating  duty  positions 
as  follow:  1 ••  Driver,  2 > Loader,  3 > Gunner,  4 - Tank  Commander. 

The  nusbers  In  the  last  two  places  simply  distinguish  tasks  within 
the  various  tank/duty  position  categories;  A5103,  for  example,  Is  Cask 
number  3 in  the  M48A5  Driver  set. 

*Task  lists  for  the  M60A1(AOS},  chough  not  contractually  required, 
were  prepared  because  doing  so  required  little  effort.  They  were 
not  used  In  subsequent  analyses. 
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APPENDIX  D 

METHOD  FOR  PAIRING  TASKS  IN 
THE  PARTIAL  PAIRED 
COMPARISON  QUESTIONNAIRES 
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METHOD  FOR  PAIRING  TASKS  IN  THE 
PARTIAL  PAIRED  COMPARISON  QUESTIONNAIRES 


The  method  followed  for  pairing  tasks  had  three  steps: 

(1)  Decide  how  many  times  to  pair  each  task. 

This  decision  is  governed  by  the  amount 
of  time  respondents  can  devote  to  the 
study.  The  rule  for  this  study  was:  If 
the  task  list  has  an  even  number  of  tasks, 
pair  each  task  seven  times;  if  the  task 
list  has  an  odd  number  of  tasks,  pair 
each  task  six  times. 

(2)  Calculate  the  total  number  of  pairs  desired. 

The  formula  for  this  calculation  is: 

Jasks  on. .lis_t_^  Number  of  pairs.  . desired. 

(3)  Select  random  tasks  for  pairing.  This  step 
requires  a four  part  procedure: 

. Determine  an  interval  by  dividing  tne 
number  of  tasks  by  the  desired  number  of 
pairs . 

. Select  the  first  starting  point  (or  points) 
for  counting.  If  the  number  of  tasks  is 
even,  start  at  the  approximate  midpoint  of 
the  task  list.  If  the  number  of  tasks  is 
odd,  start  at  the  two  points  that  bracket 
the  midpoint  by  half  the  interval. 

. Count  out  from  the  starting  point  (or  points) 
and  select  the  starting  point  and  each  task  at 
the  Interval  to  be  paired  with  Task  1. 

. To  select  pairs  for  succeeding  tasks  add  one 
to  each  task  number  paired  with  the  preceding 
task. 

Stop  pairing  tasks  when  the  desired  number  of  pairs 
is  reached. 


This  method  of  forming  the  pairs  rosy  be  Illustrated  by  two  examples. 
The  total  number  of  tasks  for  the  M60A1  Driver  was  70.  Since  the  total 
number  of  tasks  is  even,  seven  pairings  of  each  are  desired.  The  total 
number  of  pairs  of  tasks  that  will  appear  on  the  questionnaire  is 


An  Interv^  Is  obtained  by  dividing  the  number  of  tasks  by  the  desired 
number  of  pairings.^for  each  task;  70  -t  7 “ 10.  One  then  begins  at  the 
approximate  midpoint  of  the  70  tasks,  using  the  Interval  to  count  up  and 
down  from  the  midpoint  to  obtain  seven  task  numbers.  The  seven  task 
numbers  thus  ob|plned  are  35  (approximate  midpoint),  25  (ten  less  than  35), 
15  (another  ten  less),  5 (another  ten  less);  45  (ten  more),  55,  and  65. 

The  tasks  corresponding  to  these  numbers  are  paired  with  Task  1.  Task  2 ^ 

is  paired  with  the  seven  tasks  corresponding  to  each  of  the  seven  task 
numbers  plus  one:  Task  2 is  paired  with  Task  6,  then  with  16,  with  26, 
and  so  forth.  Task  3 is  paired  with  each  of  the  seven  numbers  for  Task  2 
plus  one:  3 with  7,  3 with  17,  3 with  27,  and  so  forth.  The  progression 
is  followed  until  the  desired  number  of  pairs  (245  in  this  case)  is  reached. 

If  the  total  number  of  tasks  is  odd  and  six  pairings  of  each  are 
desired,  a procedure  is  followed  that  is  identical  in  most  respects  to 
the  one  described  above.  The  difference  is  that  after  obtaining  the 
Interval,  one  begins  counting  up  and  down,  not  from  the  approximate 
midpoint,  but  from  two  points  approximately  equidistant  by  half  the  inter- 
val from  the  midpoint.  For  example,  the  total  number  of  tasks  for  the 
M60A3  Loader  was  65.  The  number  of  pairs  of  tasks  that  will  appear  on 
the  questionnaire  is  ^ -•  ■ 195.  The  interval  is  65/6  » 11,  and  the 
midpoint  is  33.  Adding  and  subtracting  approximately  half  the  interval 
to  and  from  the  midpoint  yield  starting  points  at  Tasks  27  and  38  (or  28 
and  39) . Counting  up  and  down  by  11  yields  four  additional  tasks  (num- 
bers 5,  16,  49,  and  60).  These  and  Tasks  27  and  38  get  paired  with  Task  1. 
Task  2 Is  paired  with  Tasks  6,  17,  28,  39,  50,  and  61;  and  so  forth  until 
the  desired  number  of  pairs  (195)  is  reached. 

The  methods  described  above  are  applicable  in  all  cases  where  the 
total  nuober  of  tasks  is  greater  than  28.  At  some  numbers  of  tasks  less 
than  28,  the  effects  of  rounding  the  Interval  present  problems.  With  a 
total  of  20  tasks,  for  example.  Task  1 would  get  paired  with  itself.  And 


with  a total  of  10  tasks,  .:he  interval  is  one,  which  would  le^d  to  a 
complete  rather  than  a partial  pairing  of  tasks.  These  problems  are 
unimportant,  since  with  a small  number  of  tasks,  the  use  of  complete 
pairings  would  become  feasible  and  the  need  for  using  partial  pairings 
would  disappear. 
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INSTRUCTIONS  TO  RESPONDENTS  FOR  THE 
PAIRED  COMPARISON  QUESTIONNAIRE 

Materials 

j 

Please  check  to  see  that  you  have  two  sets  of  papers  in  addition  to 
these  Instructfpns.  The  two  sets  of  papers  are:  j 

A.  A,  set  of  Answer  Sheets,*  and  ; 

B.  A pet  of  papers  entitled  "Paired  Compairsons . " j 

If  you  do  not  have  both  sets  of  papers,  please  raise  your  han(f  and  we'll  | 

give  you  what  you  need.  ] 

I 

I 

I 

Personal  Data  I 

• • 

Please  look  at  the  cover  page  of  the  Answer  Sheets,  entitled  "Personal 
Data."  We'd  like  you  to  fill  in  your  name,  rank,  and  so  forth.  Please 
be  assured  that  your  answers  will  be  treated  as  anonymous.  Our  Interest 
Is  not  In  who  gives  what  answers , and  none  of  this  information  will  be  used 
against  you.  Later  on  though,  we  may  want  to  find  out  if  people  with 
different  kinds  and  amounts  of  experience  answered  the  questions  differently. 

We  also  may  want  to  contact  you  for  some  follow-up  questions . To  do 
these  things  we  will  need  the  Personal  Data. 

Please  fill  in  all  the  lilanks  on  the.  cover  page  of  the  Answer  Sheets.  • 

If  anything  Is  not  clear,  please  ask  questions. 

Purpose  of  the  Exercise 

The  purpose  of  this  exercise  is  to  find  out  what  sorts  of  priorities 
you  place  on  crew  menfjers'  ability  to  perform  various  tasks.  To  do  this, 
we  would  like  you  to  make  several  assumptions: 


*Last-mlnute  changes  required  not  using  answer  sheets,  and  that  the  ques- 
tionnaires be  taken  home  by  respondents  rather  than  administered  in  a 
conference  room  as  originally  intended.  Respondents  were  told,  therefore, 
to  circle  their  resoonses  oti  the  questionnaire,  and  to  ignore  parts  of  the 
instructions  that  implied  group  administration. 
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. Assume  chat  you  are  a company  commander. 

. Assume  further  that  you  must  choose  crew 
members  to  take  on  a mission. 

. Assume  also  that  you  and  your  crews  are 
cei^taln  to  encounter  the  enemy  during  the 
mission,  and  will  exchange  fire  with  him. 

To  get  you  to  choose  crew  members,  we  will  present  several  pairs  of 
tasks.  The  crew  member  whom  you  choose  can  do  only  one  of  the  two  casks 
in  each  pair.  Each  of  you  will  be  dealing  with  only  one  crew  position 
and  only  one  tank.  Here's  an  example  of  a pair  of  tasks  like  the  ones 
we'll  ask  you  about: 

A.  Inspect  an  M219  machlnegun. 

B.  Stow  main  gun  rounds  In  tank. 

(The  example  Is  for  an  M60A1  Loader,  which  may  not  correspond  to  the  tank 
and  crew  posltlpn  that  you'll  be  dealing  Vi th . But  the  instructions  that 
follow  apply  regardless  of  the  tank  and  crew  position  that  yop'll  be  work- 
ing with.) 

If  you  choose  A In  the  example,  you  will  get  a Loader  who  can  Inspect 
an  M219  machlnegun,  but  cannot  stow  main  gun  rounds  in  an  M60A1.  If  you 
choose  B in  the  example,  you  will  get  a Loader  who  can  stow  main  gun  rounds 
lit  the  M60A1,  but  cannot  inspect  an  M219  machlnegun.  (We  realize  that 
this  Is  not  a realistic  assumption,  but  please  accept  It  for  purposes  of 
the  s tudy . ) ' 

Any  questions  up  to  this  point?  If  so,  raise  them  now,  and  let's  try 
to  get  them  answered.  If  not,  please  proceed  with  the  following  five 
practice  problems.  All  of  the  practice  problems  apply  to  the  M60A1  Loader. 
The  problems  that  you  will  do  later  may  apply  to  a different  tank  and  a 
different  crew  position. 
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Practice  Problems 


A.  Mount  an  M219  machlnegun  in  tank. 

PI 

B.  Perform  operator  maintenance  on  radios 
and  accessories. 

If  you  would  rather  have  the  Loader  who  can  mount  an  M219  machlnegun, 
darken  A in  the  PI  row  of  the  Practice  block  of  the  Answer  Sheet.  If  you 
would  rather  have  the  Loader  who  can  perform  operator  maintenance  on 
radios  and  accessories,  darken  B in  the  PI  row.  Please  make  your  marks 
dark  and  heavy.  Tlie  answer  sheets  will  be  machine  scored. 

A.  Clean  an  M219  machlnegun. 

P2 

B.  Boreslght  IR  sl^t  of  Gunner's  periscope 
during  daylight. 

Would  you  rather  have  a Loader  who  could  do  A,  or  a Loader  who  could 
do  B?  Remember  — you  can't  have  both,  so  you  must  choose  one.  If  A, 
darken  A after  P2  on  the  Answer  Sheet.  If  B,  darken  B.  Any  questions  up 
to  this  point?  If  so,  please  raise  them.  If  not,  please  complete  practice 
problems  P3,  P4,  and  PS: 

A.  Install  main  gun  breechblock. 

P3 

B.  Service  tank  main  gun  ammunition. 


A.  Unload  misfired  main  gun  round. 


B.  Disassemble  the  breechblock. 


A.  Operate  vehicular  Intercommunications 
equipment . 


B.  Place  gun  tube  in  travel  lock. 


Ill 


If  you've  completed  all  five  practice  problems  and  have  no  questions, 
please  read  the  section  that  follows,  and  then  proceed  with  the  remaining 
Iten.  Take  your  time,  and  if  there's  any  part  of  the  exercise  you  don't 
understand,  please  ask  us  about  It. 

Note  on  Gunnery  Items 

Several  of  the  comparisons  that  you  will  make  will  Involve  gunnery 
items,  which  require  a word  of  explanation.  Here’s  a pair  of  gunnery  tasks 
for  the  M60A1: 

A.  Gunner  fires  main  gun  battleslght  engagement 
using  the  GPD  (stationary/moving) . 

B.  Tank  Commander  fires  nonprecision  .50  caliber 
engagement  using  the  TPI  (stationary/moving) . 

The  fire  control  instruments  In  this  example  and  In  all  the  other  gunnery 

Items  will  be  abbreviated.  The  abbreviations  and  their  definitions  are: 

AUX  ■ Auxiliary  Fire  Controls 

GPD  ■ Gunner's  Periscope  Day 

GPI  « Gunner's  Periscope  Infrared 

INF  - Infinity  Sight 

RFD  - Rangefinder  Day 

RFI  ■ Rangefinder  Infrared 

TEL  ••  Telescope 

TPD  ■ Tank  Commander's  Periscope  Day 
TPI  > Tank  Commander's  Periscope  Infrared 
The  two  words  In  parentheses  after  each  item  refer  to  the  movement 
of  the  firing  vehicle  and  the  target  — In  that  order.  Thus,  moving/ 
stationary  means  moving  firing  vehicle /stationary  target.  And  stationary/ 
moving  means  stationary  firing  vehicle /moving  target. 
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Finally,  all  gunnery  itetna  begin  with  either  the  word  Gunner  or  Tank 
Comnander.  This  does  not  necessarily  mean  that  you  are  choosing  a Gunner 
or  a Tank  Commander.  Suppose,  for  example,  that  the  notation  at  the  top 
of  your  paired  comparison  sheet  Is  for  Loader,  M60A1.  And  you  have  a 
gunnery  item  such  as: 

A.  Gunner  fires  main  gun  battleslght  engagement 
using  the  GPD  (stationary /moving) . 

B.  Tank  Commander  fires  nonprecision  .50  caliber 
engagement  using  the  TPI  (stationary /moving) . 

If  your  job  is  to  choose  a Loader,  you  must  ask  yourself,  "Would  I rather 

have  a Loader  v\io  could  perform  the  Loader  * s duties  associated  with  A 

above;  or  a Loader  who  could  perform  the  Loader's  duties  associated  with 

B,  above?"  The  fact  that  the  Gunner  Is  firing  one  of  the  engagements 

in  the  example,  and  the  Tank  Commander  Is  firing  the  other  engagement  is 

largely  Irrelevant  here,  since  we're  choosing  not  a Gunner  or  a Tank 

Coimnander,  but  a Loader. 


PLAN  FOR  EXAMINING  CONSTRUCT  VALIDITY 
OF  THE  CRITICALITY  RATINGS 
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The  main  requirement  in  any  plan  to  validate  skill  criticality 
ratings  is  to  miniinize  dependence  on  expert  judgment  in  defining  the 
criterion  measures.  If  this  is  not  done,  then  validation  reduces  to 
establishing  the  correlation  between  two  sets  of  expert  opinions.  High 
correlations  might  indicate  reliable  ratings  (that  both  sets  of  ratings 
were  made  on  the  same  or  highly  correlated  concepts) , but  are  not  ade- 
quate evidence  that  Judges  were  considering  the  concept  of  criticality 
in  their  ratings. 

The  ideal  validation  plan  would  involve  actual  or  simulated  combat 
missions,  embarked  upon  under  identical  conditions  as  many  times  as 
there  are  identified  skills.  On  each  enactment,  one  skill  would  be 
missing.  Attainment  of  the  mission  objective  would  then  be  rated  as 
success  or  failure.  By  replicating  across  many  missions,  the  proportion 
of  failures  would  be  used  as  the  criticality  rating  for  the  skill 
designated  as  "missing"  for  those  mission  enactments. 

Such  an  approach  would  certainly  provide  Information  concerning 
the  degree  to  which  deficiencies  in  skills  degrade  performance  of  a 
mission,  or  criticality.  But  the  disadvantages  are  obvious  and  over- 
whelming: time  and  cost  requirements;  Impossibility  of  standardizing 
conditions;  and  difficulty  in  ensuring  that  tasks  in  all  skill  areas 
are  performed  adequately,  except  for  those  in  the  "missing"  skill,  which 
must  not  be  performed.  If  the  tasks  and  skills  could  be  fully  defiped 
in  terms  of  initiators,  standards  of  performance,  and  consequences  of 
performance  or  nonperformance,  and  if  all  interactions  among  consequences 
of  performance  or  nonperformance  of  all  skills  were  known,  and  if  all 
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I 

j 

i 

consequences  and  Interactions  of  consequences  could  be  empirically  related  j 

to  success  or  failure,  then  a mathematical  model  could  be  defined  and  | 

i 

computer-simulated  to  overcome  all  the  former  difficulties.  This  would 
be  a major  task,  for  which  data  concerning  "successful"  consequences  , 

would  have  to  be  obtained  as  described  above,  at  which  point  the  same 
disadvantages  Immediately  would  re-emerge.  The  need  for  actual  or  simulated 
missions  could  be  side-stepped  by  presenting  the  situations  to  a panel 
of  experts  and  obtaining  their  Judgments  of  specific  consequences  of  In- 
adequate performance  on  each  skill,  which  could  then  be  converted  to, 
perhaps,  a five-point  success/f allure  scale.  This  again  reduces  to  a set 
of  expert  opinions,  which  may  reflect  task  difficulty  or  frequency  of  \ 

performance  as  well  as  criticality. 

i 

J 

From  the  foregoing  It  may  be  seen  that  there  are  two  general 
approaches  to  obtaining  skill  criticality  ratings  for  purposes  of  vail-  j 

datlon:  the  empirical  study,  to  obtain  "real"  criticality,  or  the  expert 
questionnaire  study,  to  obtain  estimates  of  criticality.  The  first  Is 
costly,  time-consuming,  and  practically  (as  opposed  to  theoretically) 

Impossible.  The  second  produces  results  which,  thou^  possibly  reliable,  ; 

may  be  confounded  among  criticality,  difficulty,  complexity,  or  frequency  ; 

1 

of  performance.  Any  conhlnatlon  of  the  two  approaches,  while  It  may  serve  j 

H 

to  eliminate  some  of  the  problems  Inherent  In  one,  will  necessarily  be  j 

subject  to  problems  associated  with  the  other. 

t ^ 

' 1 

j 

A method  Is  available,  however,  whereby  the  expert  ratings  of  crltl-  j 

callty,  obtained  through  the  paired-comparison  technique,  may  be  examined 

for  possible  Influences  or  contamination  from  factors  other  than  criticality.  1 

I 

The  correlational  study  of  validity,  developed  by  Campbell  and  Flske  (1959),  ’ 

encompasses  measures  of  several  factors,  each  measured  by  two  or  more 

methods.  Measures  of  the  same  factor  by  dissimilar  methods  should  converge,  | 

while  measures  of  different  factors  by  the  same  or  different  methods  should  | 

diverge . .] 
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The  most  frequently  encountered  challenges  to  the  validity  of  criti-  | 

cality  ratings  are  that  the  ratings  represent  learning  difficulty  (DF) , \ 

or  performance  deficiency  (PD)  as  perceived  by  raters.  Tlie  validation  | 

study  will  examine  skill  ratings  as  derived  from' task  ratings  on  these  j 

variables  and  on  criticality  (CR)  by  two  methods.  The  results  of  the  | 

analysis  will  provide  Information  concerning  the  Independence  of  the  | 

criticality  variable  from  other  variables  that  might  influence  criticality  j 

ratings. 

METHOD 

Raters 


The  measures  of  criticality  and  other  variables  will  be  obtained 
from  volunteers  from  the  Armor  Officers'  Advanced  Course  at  Fort 
Knox.  Each  person  will  respond  to  items  by  the  two  methods  for  criticality, 
difficulty  to  learn,  and  performance  deficiency. 

Procedure  1;  Paired  Comparisons 

The  first  method  will  require  raters  to  make  judgments  of  the  criti- 
cality (CR) , learning  difficulty  (DF) , and  performance  deficiency  (PD) 
of  pairs  of  tasks.  Twenty  tasks  will  be  paired  according  to  the  partial- 
pairing  algorithm  of  McCormick  and  Bachus  (1952),  yielding  a tota2  of  60 
pairs  to  be  Judged  three  times  in  each  of  the  twelve  sets.  On  the  basis 
of  the  raters'  Judgments,  scale  values  for  CR,  DF,  and  PD  will  be  assigned 
to  each  of  the  tasks  Judged.  These  values  will  then  be  averaged  for  tasks 
within  the  skill  clusters  defined  by  the  cluster  analysis,  across  tanks, 
to  yield  CR,  DF  and  PD  scale  values  for  each  skill  within  the  four  'duty 
positions,  for  each  rater. 
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Tasks 


Each  of  the  twelve  sets  of  tasks  will  be  comprised  of  a sample  of 
all  tasks  from  each  duty  position  (Driver,  Loader,  Gunner,  Tank  Commander) 
by  each  tank  (M60A1,  M48A5,  M60A3) . The  tasks  were  assigned  criticality 
ratings  In  the  paired  comparison  study  described  In  this  report.  A total 
of  20  tasks  from  the  criticality  study  will  be  used  In  the  validation. 

The  20  tasks  will  be  the  seven  most  critical,  the  seven  least  critical, 
and  the  six  closest  to  the  median  criticality  rating. 

Instructions 

To  obtain  the  CR  ratings,  the  same  Instructions  will  be  given  to  the 
raters  as  were  given  In  the  criticality  study. 

In  obtaining  ratings  of  DF,  the  Instructions  to  the  raters  will  vary 
only  In  that  they  are  Instructed  to  assume  that  they  must  decide  which  of 
the  two  crew  menbers,  each  of  whom  Is  deficient  on  one  task,  will  require 
the  greatest  amount  of  practice  In  order  to  bring  him  up  to  proficiency 
on  that  task,  so  that  he  would  be  able  to  perform  the  task  adequately  In 
a live  fire  engagement.  ^ 

For  ratings  of  PD,  the  Instructions  will  ask  the  raters  to  judge 
on  which  of  a pair  of  tasks  Incumbents  are  more  likely  to  be  deficient. 

By  this  method,  each  of  three  factors  — CR,  DF,  and  PD  — has 

an  implicit  operational  definition,  as  follows: 

CR  (criticality)  - the  extent  to  which  deficiency  on 
the  task  would  degrade' mission  success. 

DF  (learning  difficulty)  - the  amount  of  practice 
needed  to  ensure  proficiency  on  a task. 

PD  (performance  deficiency)  - likelihood  that  Incumbents 
are  deficient  on  the  task. 
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Each  of  the  raters  will  make  judgments  for  all  three  dimensions,  on 
only  one  of  the  12  sets  of  casks  (four  duty  positions  within  each  of  three 
tanks).  At  least  five  raters  must  rate  each  of  the  secs. 

Procedure  2;  Rating  of  Behavioral  Descriptors 

Each  task  considered  In  this  study  already  has  been  characterized  In 
terms  of  a set  of  task  descriptors.  These  descriptors  will  be  rated  by 
the  raters  in  terras  of  CR,  DF  and  PD.  The  ratings  will  then  be  summed  for 
each  task,  according  to  whether  or  not  the  descriptor  is  Involved  In  per- 
formance of  the  task,  and  then  averaged  for  tasks  within  the  skill  clusters 
to  yield  scale  values  for  CR,  DF  and  PD  within  each  duty  position  for 
each  rater. 

Behavioral  Descriptors 

The  behavioral  descriptors  to  be  used  in  the  ratings  are  those  that 
were  used  to  define  the  tasks  for  the  cluster  analyses.*  They  are  listed 
and  defined  in  Appendix  A. 

Instructions 

The  raters  will  be  given  the  list  of  behavioral  descriptors  and  a 

list  of  the  definitions  of  the  descriptors.  They  will  be  Instructed 

to  rate  the  32  tasks  on  a scale  from  1 to  50,  on  CR,  DF,  and  PD,  where  1 > 

least  crltlcal/dlfflcult/deflclent,  and  50  •>  extremely  critical/difficult/ 

deficient.  The  three  factors  will  be  defined  for  the  raters  as: 

CR  - the  extent  to  which  deficient  performance  on  the 
descriptor  would  degrade  performance  of  the  soldier's 
tasks . 

DF  - the  amount  of  practice  required  by  the  soldier  to 
attain  proficiency  on  the  behavior. 

PD  - the  likelihood  that  incumbents  will  be  deficient  in 
performance  of  the  behavior. 


’^nly  32  of  the  descriptors  will  be  used.  The  descriptors  numbered  8 
(Smell),  17  (None),  24  (Identifies  Symbols)  and  36  (None)  will  be  deleted 
because  they  were  not  used  to  characterize  any  task  in  the  original  study. 
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The  instructions  will  be  similar  to  those  shown  in  Appendix  I. 

Each  rater  will  consider  the  descriptors  relative  to  only  one  of  the  four 
duty  positions,  the  same  duty  position  which  he  considered  in  making  the 
paired  comparison  ratings.  Thus  the  descriptors  will  be  considered  by 
at  least  IS  raters  for  each  duty  position. 

ANALYSIS 

The  first  step  in  the  analysis  will  be  to  compute  a rank  order 
correlation  between  the  CR  values  obtained  from  the  paired  comparisons 
in  the  Criticality  Study  and  in  the  Validation  Study.  All  skills  will 
be  ranked  from  1 to  N (the  number  of  skills  for  the  duty  position)  on 
the  two  sets  of  CR  values;  the  rank  order  correlation  should  be  at  least 
.60  to  ensure  that  the  same  construct  of  criticality  is  being  validated. 

For  each  of  the  four  sets  of  skills  (one  for  each  duty  position) , 
the  scale  values  of  CR,  DF,  and  PD  from  each  rater  by  the  two  methods  will 
be  correlated.  The  correlations  will  be  entered  in  a correlation  matrix, 
as  illustrated  in  Table  H-1. 

The  hypothesis  is  that  the  correlations  will  be  fairly  substantial 
in  the  sections  of  the  matrix  for  each  variable  by  the  two  methods  (su- 
perscribed a,  b,  and  c in  Table  H-1,  and  that  the  remaining  correlations, 
which  presumably  pair  distinctive  variables,  will  be  low.  The  measures 
of  CR  and  PD  converge  very  well  in  the  example,  having  correlations  of  .91 
and  .89,  respectively.  The  two  measures  of  DF  correlate  somewhat  lower 
(.75),  but  still  higher  than  ratings  of  different  variables  by  the  same 
methods  (superscribed  d and  e) . The  correlations  between  DF  and  CR  by 
either  method  are  only  slightly  higher  than  wlthin-method  correlations 
between  DF  and  PD  but  considerably  higher  than  the  within-method  correlations 
between  CR  and  PD.  This  suggests  that  DF  is  more  difficult  for  raters 
to  assess  than  CR  or  PD,  and  somewhat  more  easily  confused  with  CR  than 


TABLE  F-1 


MULTIFACTOR-MULTIMETllOD  MATRIX  OF  HYPOTHETICAL 
CORRELATIONS  OF  CRITICALITY,  LEARNING  DIFFICULTY, 
AND  PERFORMANCE  DEFICIENCY  SCALE  VALUES  OBTAINED 
BY  PAIRED  COMPARISONS  AND  RATINGS 
OF  BEHAVIORAL  DESCRIPTORS 
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is  PD.  Still,  each  of  the  three  variables  emerges  as  distinct,  with  little 
overlap  between  variables  within  methods,  and  high  convergence  within 
variables  across  methods. 

] 

The  data  obtained  In  the  administration  of  the  two  Instruments  for 
each  of  the  three  variables  will  be  entered  into  multlvarlable-multlmethod  ] 

• -I 

matrices  for  each  set  of  skills.  The  matrices  will  then  be  examined  for 
convergence  and  divergence  as  described  and  Illustrated  In  the  example. 

The  validity  of  the  criticality  ratings  can,  of  course,  be  challenged 
on  the  grounds  of  confounding  by  sources  other  than  learning  difficulty 
and  performance  deficiency.  The  effects  of  the  other  sources  can  be  Iso- 
lated using  a design  identical  to  the  one  described  here. 

i; 


DEFINITIOKS  OP  TASK  DESCRIPTORS 


STIMULI 

1.  Written  (textual)  material;  (books,  job  Instructions,  signs, 
technical  manuals.) 

2.  Graphic/ tabular  material;  (Materials  which  deal  with  quantities 
or  amount^s  and  displayed  in  graphic  or  tabular  form.) 

3.  Instrument  read-outs;  (Tools,  equipment,  machinery  which  are 
sources  of  information  when  observed  during  use  or  operation, 
for  example,  dials,  gauges,  signal  lights,  radarscopes,  speedo- 
meters, timing  light,  mine  detector,  multimeter.) 

4.  Natural  environmental  features;  (Landscapes,  fields,  geological 
samples,  vegetation,  cloud  formations,  and  other  features  of 
nature  which  are  observed  or  Inspected  to  provide  Information.) 

5.  Man-made  environmental  features;  (Man-made  or  altered  aspects 
of  the  indoor  or  outdoor  environment  which  are  observed  or  in- 
spected to  provide  job  information;  do  not  consider  equipment 
or  machines  that  a soldier  uses  in  his  work.  For  example, 
structures,  buildings,  dams,  highways,  bridges,  docks,  pallroads.) 

6.  Oral  command  or  request:  (Verbal  orders,  instructions,  requests, 
conversations,  interviews,  discussions,  formal  meetings.  Consider 
only  verbal  communication  that  is  relevant  to  performance.) 

7.  Non-verbal  sounds;  (Noises,  engine  sounds,  sonar,  signals,  horns.) 

8.  Smell  (olfaction) ; (Odors  which  the  soldier  needs  to  smell  in 
order  to  initiate  performance;  do  not  include  odors  simply  be- 
cause they  happen  to  exist  in  the  woxk  environment.) 

9.  Body  feel  (kinesthesia):  (Sensing  or  recognising  changes  in  the 
direction  or  speed  at  which  ihe  body  is  moving  without  being  able 
to  sense  them  by  sight  or  hearing.) 
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10.  Touch ; (PreBBure,  pain,  temperature,  noiature;  providea  inform 
mation  stimulua  for  performing  the  task.) 

11 . Self-initiated;  (If  a task  can  be  performed  without  performing 
a sub-task,  no  matter  the  consequences  of  not  performing  the 
sub-task,  then  that  sub-task  is  self -ini tinted.  For  example, 
the  loader  can  LOAP  TANK  MAIN  GUN  without  "checking  replenlsher 
tape,"  "Inspecting  th*  chasber  for -obstruction,"  or  "standing 
clear  of  path  of  recoil."  These  sub- tasks  are  then  self- Ini tinted.) 

TOOLS,  INSTRUMENTS,  AND  CONTROLS 

12.  Common  hand  tools  and  measuring  devices;  (Tools  used  to  perform 
operations  not  requiring  great  accuracy  or  precision;  fqr  example, 
hammers,  wrenches,  trowels,  knives,  scissors,  chisels,  putty 
knives,  strainers,  hand  grease  guns.  Measuring  de-vices  Include 
rules,  measuring  tapes,  micrometers,  calipers,  protractors, 
squares,  thickness  gauges,  levels,  volume  measuring  de-vlces, 

tire  gauges.  Tools  and  measuring  devices  which  are  not  unique 
to  a tank  en-vlronment.) 

13.  Special  hand  tools  and  measuring  devices;  (Tools  and  measuring 
de-vlces  which  are  unique  to  a tank  environment.  For  exaBq>le,  the 
extracting  and  ramming  device.) 

14.  Activation  controls;  (Hand-or  foot-operated  de-vlces  used  to 
start,  stop,  or  otherwise  activate  eneritv-uslng  systems  or 
mechanisms . For  exaiqple,  light  switches,  electric  motor  switches; 
ignition  switches,  power  turret  traverse.) 

15.  Fixed  setting  controls;  (Hand-  or  foot-operated  de-vlces  with 
distinct  positions,  detents,  or  definite  settings.  For  example, 
gearshift,  machlncgun  safety  switch,  ammunition  bpntrol  handle.) 

16.  Variable  setting  controls;  (hand-' or  foot-operated  devices  that 
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can  be  set  at  the  beginning  of  operation,  or  infrequently,  at 
any  position  along  a scale.  For  exanple,  TV  volume  control, 
room  thermostat,  rheostat,  rangefinder  range  knob.) 

17.  None;  (Tools,  instruments,  or  controls  sre  not  used  when 
performing  the  task  on  sub-task,) 

KBDIATING  PROCESSES 

18.  Recalls  bodies  of  knowledge;  (Concerns  verbal  or  synbolic 
learning:  acquisition  and  long-term  maintenance  of  knowledge 
so  that  it  can  be  recalled.  For  example,  recalling  equipment 
nomenclature  or  functions,  rqcslllng  system  functions,  re- 
calling specific  radio  frequencies  and  other  discrete  facts.) 

19.  Uses  veri^al  information;  (Concerns  the  practical  application  of 
information,  limited  uncertainty  of  outcome,  little  thought  of 
other  alternatives.  For  example,  based  on  academic  knowledge; 
determine  which  equipment  to  use  for  a specific  task;  conpare 
alternative  modes  of  operation  of  a piece  of  equipment  and 
determine  the  appropriate  mode  for  a specific  situation.  Based 
on  memorized  knowledge  of  radio  frequencies,  choose  the  correct 
frequency  in  a specific  situation.) 

20.  Uses  rules;  (Choosing  a course  of  action  based  on  applying 
known  rules,  frequently  involves  "if  ...  then"  situations.  The 
rules  are  not  questioned,  the  decision  focuses  on  whether  the 
correct  rule  is  being  applied.  For  exasp le,  apply  the  "rules 
of  the  road,"  solve  mathemstleal  equations,  select  proper  fire 
extinguisher  for  different  type  fires.) 

21.  Makes  decisions;  (Choosing  a course  of  action  when  alternatives 
are  unspacified  or  unknown;  a successful  courss  of  action  is  not 
rssdily  apparant.  The  penalties  for  unsuccessful  courses  of 
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action  are  not  readily  apparent.  Frequently  Involves  forced 
decisions  made  In  a short  period  of  time  with  soft  information. 

For  example,  threat  evaluation  and  weapon  assignment;  choosing 
a diagnostic  strategy  in  dealing  with  a malfunction  in  a complex 
piece  of  equipment.) 

22.  Detects  (Including  vigilance) ; (Vigilance  — detect  a few  cues 
embedded  in  a large  b).ock  of  time.  Low  threshold  cues;  early 
awareness  of  small  cues.  For  example,  early  detection  of  a 
target,  detect,  throug)i  a slig^it  change  in  sound,  a bearing 
starting  to  bum  out  in  a power  generator.) 

23.  Classifies;  (Pattern  recognition  approach  of  identification  — 
not  problem  solving.  Classification  by  non-verbal  characteristics. 
Object  to  be  classified  can  be  viewed  from  many  perspectives 

or  in  many  forms.  Fox  example,  classify  a target  as  "friendly" 
or  "enenor";  determine  that  an  identified  noise  is  a wheel 
bearing  failure,  not  a water  pump  failure  by  rating  the  quality 
of  the  noise  — not  by  the  problem  solving  approach.) 

2A.  Identifies  Symbols;  (Involves  the  recognition  of  symbols  which 
typically  am  of  low  meaningfulness  to  mtralned  persons. 
Identification,  not  interpretation,  is  emphasized.  Involves 
storing  queries  of  syiabollc  information  and  related  meanings. 

For  example,  reading  electronic  syBd}ols  on  a schematic  drawing; 
identifying  map  symbols;  reading  and  transcribing  symbols  on  a 
tactical  status  board.) 

25.  Recalls  set  procedures;  (Concerns  the  chaining  or  sequencing  of 
events;  Includes  both  the  cognitive  and  motor  aspects  of  equipment 
set-up  and  operating  procedures.  Meed  to  follow  specific  set 
procedures  on  routines  in  order  to  obtain  satisfactory  outcomes. 
For  example,  recalling  equipment  assembly  and  disassembly 
procedures;  recalling  the  operation  and  check  out  procedures  for 
a piece  of  equipment;  following  equipment  tum-on  procedures  — 
emphasis  on  motor  behavior.) 


Estimates  speed;  (Concerns  the  speed  of  moving  objects  or 
materials  relative  to  a fixed  point  or  to  other  moving  objects. 
For  example,  the  speed  of  vehicles.) 

Estimates  distances;  (Concerns  the  distance  from  one  location  to 
another.  For  example,  from  observer's  location  to  an  object  on 
the  horizon.) 

Adopts  proper  attitude:  (Concerns  exhibiting  a pattern  of  be- 
havior consistent  with  an  attitude  or  value;  a willingness  to 
perform  according  to  a standard  as  opposed  to  skill  to  perform 
according  to  that  standard.  Integrating  or  organizing  a value 
or  attitude  Into  a pattern  of  behavior.  For  example,  complying 
with  known  safety  standards  while  performing  a maintenance 
procedure  on  a high  voltage  power  supply.) 

OVERT  RESPONSES 

Finger  manipulation;  (Concerns  making  finger  movements  in 
various  types  of  activities;  usually  the  hand  and  arm  are  not 
Involved  to  any  great  extent.  For  example.  Indexing  announced 
ammunition  Into  computer.) 

Hand-arm  movement;  (Concerns  the  manual  control  or  manipulation 
of  objects  throu^  hand  or  arm  movements,  which  may  or  may  not 
require  continuous  visual  control;  requires  coordination  of 
hand-arm  movements.  For  exanple,  pull  charging  handle  of 
MSS  machlnegun  rearward  until  bolt  locks  In  place;  open  breech.) 

Foot- leg  movement:  (Concerns  the  manual  control  or  manipulation 
of  objects  through  foot  or  leg  movements,  which  may  or  may 
not  require  continuous  visual  control;  requires  coordination 
of  foot- leg  movements.  For  exaiiq>le,  lock  pazking  brakes  on  a 


32.  Steers;  (Concerns  compensatory  movements  based  on  £eedf>ack  from 
displays;  Involves  estimating  changes  In  positions,  velocities, 
accelerations  and  a knowledge  of  display  — control  relationships. 
For  example,  tank  driver  following  a road.) 

33.  Tracks ; (A  perceptual-motor  activity  involving  continuous  pursuit 
of  a target  or  keeping  dials  at  a certain  reading;  requires 
smooth  muscle  coordination  patterns  — lack  of  overcontrol. 

For  example,  tank-gunnery  target  tracking;  sonar  operator 
keeping  the  cursor  on  a sonar  target.) 

34.  Reports  in  writing;  (Concerns  the  copying  or  posting  of  inform 
mation  for  immediate  or  later  use.  For  example,  transcribing 

a radio  message;  noting  maintenance  faults  on  DA  Form  2404.) 

35.  Reports  by  talking;  (Concerns  the  oral  passage  of  routine  or 
nonroutine  Information  or  facts.  For  example,  announce  UP, 
announce  IDENTIFIED.) 

36.  None;  0*^^  or  sub-task  has  no  overt  response.) 
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APPENDIX  H 

EIGHTEEN  TASK  SAMPLE  USED  IN  THE 
PRACTICE  RATINGS 


EIGHTEEN  TASK  SAMPLE  USED  IN  THE  PRACTICE  RATINGS 


i 


1.  Perform  before-operations  maintenance  checks 
on  hydraulic  brake  system  (Driver). 

2.  Perform  before-operations  maintenance  checks 
and  services  on  tank  engine  and  transmission 

oil  levels  (Driver) , ■ 

3.  Install  the  M24  (IR)  periscope  (Driver). 

4.  Start  tank  engine  (Driver). 

5.  Perform  during-operatlons  maintenance  checks 
and  services  on  steering,  accelerator,  shift 
and  brake  controls  (Driver). 

6.  Remove  the  main  gun  breechblock  group  (Loader). 

7.  Disassemble  the  breechblock  (Loader).  l 

8.  Perform  main  gun  prepare-to-fire  procedures 
from  the  Loader's  position  (Loader). 

9.  Clear  an  M219  machlnegun  (Loader). 

10.  Load  an  M219  machlnegun  (Loader). 

11.  Prepare  tank  for  boresighting  (Loader). 

12.  Prepare  tank  for  boresighting  (Gunner). 

13.  Boresight  Gunner's  Telescope  (Gunner). 

14 . Zero  an  M219  machlnegun  (Gunner) . 

15.  Boresight  rangefinder  with  the  main  gun  bore 
axis  alined  on  an  aiming  point  at  1200  meters 
(Tank  Commander) . 

16.  Mount  an  MSS  machlnegun  In  a tank  (Tank  Commander). 

17.  Clear  an  MSS  machlnegun  (Tank  Commander). 

18.  Prepare  tank  for  boresighting  (Tank  Conmander). 
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APPENDIX  I 

TWENTY-TWO  TASK  SAMPLE  USED  TO  VERIFY 
INTER-RATER  RELIABILITY 


TWENTY-TWO  TASK  SAMPLE  USED  TO  VERIFY 
INTER-RATER  RELIABILITY 


1.  Perform  before-operations  maintenance  checks  on 
fire  extinguishers  (Driver). 

2.  Stop  tank  engine  (Driver). 

3.  Start  tank  engine  by  auxiliary  power  — slave 
start  (Driver) . 

4.  Connect  track  (Driver). 

5.  Perform  after-operations  maintenance  checks  and 
services  on  the  gun  travel  lock  (Driver). 

6.  Perform  after-operations  maintenance  checks  and 
services  on  the  tank  batteries  (Driver) . 

7.  Adjust  variable  breech  operating  cam  (Loader). 

8.  Perform  emergency  closing  of  main  gun  breech  (Loader). 

9.  Remove  an  M219  machlnegun  from  a tank  (Loader). 

10.  Drain  replenlsher  system  (Gunner). 

11.  Operate  Gunner's  quadrant  (Gunner). 

12.  Apply  immediate  action  in  case  of  main  gun  failure  to 

fire  (Gunner) . ' 

13.  Acquire  ground  targets  (night)  (Tank  Commander). 

14.  Apply  immediate  action  to  reduce  stoppage  of  an  M85 
machlnegun  (Tank  Commander) . 

15.  Gunner  fires  range  card  lay  to  direct  fire  using  Gunner's 
telescope  and  coax  (stationary /moving) . 

16.  Tank  Commander  fires  nonprecision  .50  caliber  engagement 
using  the  TPI  (moving/moving) . 

17.  Tank  Commander  fires  nonprecision  coax  engagement  using 
the  RFI  (moving/moving) . 

18.  Tank  Commander  fires  main  gun  battleslght  engagement  using 
the  RFD  (moving/stationary). 

19.  Gunner  fires  main  gun  battleslght  to  precision  engagement 
using  the  GPD  (moving/stationary) . 

20.  Gunner  fires  coax  precision  engagement  using  the  TEL  (moving 
stationary) . 

21.  Tank  Commander  fires  main  gun  range  card  lay  to  direct  fire 
using  the  RFD  (stationary/stationary). 

22.  Gunner  fires  main  gun  precision  engagement  using  the  TEL 
(stationary /moving) . 
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INTER-RATER  RELIABILITY  STUDIES: 

COMPUTATION  DETAILS  AND  DISCUSSION  OF  RESULTS 

COMPUTATION 

A phi  coefficient  was  computed  for  each  subset  of  task  descriptors 
(Stimuli;  Tools,  Instruments  and  Controls;  Mediating  Processes;  Overt 
Responses)  as  well  as  the  total  (across  subsets)  for  each  of  the  18 
tasks  both  before  and  after  rater  discussion.  The  data  for  each  task 
were  organized  into  two-by-two  bivariate  frequency  tables  for  each 
descriptor  subset  and  for  the  total.  Data  were  entered  in  180  tables 
(four  subsets  and  total,  by  18  tasks,  both  before  and  after  rater  dis- 
cussion) as  follows: 


R^  ■ Rater  1 

N 

R2  “ Rater  2 


where  a * number  of  cells  corresponding  to  task  descriptors  in  a subset 
that  both  raters  agreed  were  not  Included  in  subtasks  of  the 
task. 

b * number  of  cells  corresponding  to  task  descriptors  in  a subset 
that  Rater  1 said  "is  not"  and  Rater  2 said  "is"  Included  in 
subtasks  of  the  task, 

c - number  of  cells  corresponding  to  task  descriptors  in  a subset 
that  Rater  1 said  "is"  and  Rater  2 said  "is  not"  Included  in 
subtasks  of  the  cask. 

d ••  number  of  cells  corresponding  to  task  descriptors  in  a subset 
that  both  raters  agreed  were  Included  in  subtasks  of  the  task. 
Figure  J.l  is  a sample  racing  sheet  for  preparing  the  two-by-two  bivariate, 
frequency  table  for  the  Stimuli  subset  of  one  of  the  tasks  in  Che  sample. 
Entries  were  made  as  follows: 


R 

R 


1 

1 


R2  - 0 R2  - 1 
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3 
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PERFORM  BEFORE-OPERATIONS  MAINTENANCE  CHECKS  ON  HYDRAULIC 
BRAKE  SYSTEM  
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1.  Apply  brake  and  hold  for  approximately  30  seconds. 

RATER  1 

1 

t 

1 

1 

1 

2.  Observe  brake  pressure  gage  and  Insure  that  it 
indicates  and  maintains  750-900  PSI. 

1 

X 

i 

' 

! 

3.  Note  any  drop  in  pressure  as  a fault  on  DA  Form  2404. 

u 

t 

_1 

1 

] 

LK 

Jj 

1.  Annly  brake  and  hold  for  approximately  30  seconds. 

RATER  2 

r 

1 

n 

2.  Observe  brake  pressure  gage  and  insure  that  it 
indicates  and  maintains  750-900  PSI. 

1 

1 

1 
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' i 

3.  Note  any  drop  in  pressure  as  a fault  on  DA  Form  2404. 
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Figure  J.l.  Sample  rating  sheet  for  preparing 

two-by-two  bivariate  frequency  table. 
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Tae  sum  of  the  entries  In  any  table  Is  equal  to  the  product  of  the  number 
of  task  descriptors  In  the  subset  and  the  number  of  subtasks  in  the  task. 
(Eleven  task  descriptors  by  three  subtasks  ■ 33  entries) . 

Since  relatively  few  (typically  about  a third)  of  the  36  descriptors 
were  judged  as  characteristic  of  a given  task,  we  were  concerned  that 
inter-rater  reliability  coefficients  would  be  Inflated  by  the  large  num- 
ber of  zero-zero  agreements.  This  is  a valid  concern  to  the  extent  that 
for  a given  task  many  dlscrlptors  are  so  totally  and  obviously  Irrelevant 
that  a "0"  rating  requires  little  intelligent  judgment  on  the  part  of  Che 
racers.  To  correct  for  this  possibility,  phi  coefficients  were  computed 
using  selected  descriptors  in  each  case. 

The  coefficient  was  computed  by  first  reducing  the  entries  in  cell  "a" 
of  each  bivariate  frequency  table  by  Che  product  of  the  number  of  task 
descriptors  in  any  subset  Irrelevant  to  a particular  task  and  the  number 
of  subtasks  in  the  task.  For  example,  the  two-by-Cwo  bivariate  frequency 
table  for  the  Stimuli  subset  of  the  task  in  Figure  J<1  was  as  follows: 


R - 0 1 

i 

h - 1 i 

t 

Seven  task  descriptors  (graphic/tabular  material,  natural  environmental  | 

features,  man-made  environmental  features,  oral  command  or  request,  i 

non-verbal  sounds,  smell,  and  body  feel)  were  considered  by  both  raters  | 

Irrelevant  to  Che  set  of  subtasks  comprising  this  task;  cell  "a"  was  j 

therefore  reduced  by  21  (7  task  descriptors  by  3 sub tasks) . The  selected  ^ 

descriptors  used  to  compute  Che  phi  coefficient  for  this  subset  were 
in'ltcen  (textual)  material,  instrument  read-outs,  touch,  and  self-initiated. 

No  other  cell  entries  were  reduced  by  this  procedure.  \ 

J 

] 

i 

i 


i 


All  coefficients  of  inter-rater  reliability  reported  in  tl^e  follow- 
ing section  were  computed  using  the  more  conservative  selected  descrip- 
tors approach,  an  approach  yielding  coefficients  that  averaged  about  .055 
correlational  points  less  than  those  based  on  all  descriptors.  Results 
of  the  two  computational  approaches  are  compared  in  Appendix  K. 

RESULTS 

Effects  of  Rater  Discussion 

Inter-rater  i^ellabllltles  for  the  18  practice  tasks  are  shown  by 
descriptor  subset  and  rating  period  (before  vs.  after  discussion)  in 
Table  J.l.  The  coefficients  in  the  body  of  ^he  table  show  considerable 
variation,  and  since  many  are  based  on  fewer  than  20  observations, 
interpretations  at  the  task-by-descriptor  level  probably  are  not  useful. 

At  the  total  task  level,  however,  the  correlations  are  more  stable.  All 
but  two  of  the  36  rater  agreement  coefficients  by  task  (right-hand  column 
of  Table  J.l)  were  significant  at  the  .05  level.  The  before-discussion 
reliabilities  for  Tasks  5 and  18,  which  were  .20  and  .12  respectively, 
were  not  significant.^ 

The  effects  of  rater  practice  and  discussion  can  be  seen  in  the 
bottom  row  of  Table  J.l.  Total  (across-descriptor)  inter-rater  reliability 
Increased  after  discussion,  as  did  the  reliabilities  for  each  descriptor 
category.  The  Increase  from  .58  to  .72  in  total  inter-rater  reliability 
was  significant  at  the  .05  level. ^ The  increase  in  the  reliabilities 
for  all  but  the  Stimuli  category  of  descriptors  also  were  significant 
at  the  .05  level. ^ 

Differences  in  reliability  as  a function  of  descriptor  category 
also  are  worth  noting.  Inter-rater  reliability  was  highest  for  the 
Overt  Response  category  both  before  and  after  discussion,  and  was  lowest 


- .20]  <lr  with  28  df  » .31] 

- .12]  <[r  with  46  df  - .24] 

^The  difference  was  evaluated  statistically  using  a chi-square  type 
analysis  of  the  transformed  Fisher's  z correlation  (Hays,  1967,  p.  532). 
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Table  J.l 


INTER-RATER  RELIABILITIES  (0)  FOR  THE  18-TASK  SAMPLE 
BEFORE  AND  AFTER  RATER  DISCUSSION 


TASK  DESCRIPTOR  SUBSETS 


RATING 
TASK  PERIOD 


STIMULI  (N) 


TOOLS,  INSTMTS  MEDIATING  OVERT 
CONTROLS  (N)  PROCESSES  (N)  RESPONSES  (N) 


1 BEFORE 
AFTER 


.845  (12) 
.550  ( 9) 


1.00  ( 3) 

.671  (11) 


.293  (12) 

1.00  ( 3) 


1.00  ( 6) 

1.00  ( 9) 


2 BEFORE 
AFTER 


.633  (21) 
.848  (14) 


[ .671  (21) 
I .919  (28) 


3  BEFORE 
AFTER 


1.00  ( 9) 

.000  ( 9) 


-.158  (21) 
-.221  (28) 


.867  (14) 
1.00  (14) 


.000  ( 9) 
.478  ( 9) 


NR^  ( 0) 
NR  ( 0) 


.892  (18) 
.894  (18) 


4  BEFORE 


AFTER 


.501  (56) 
.504  (56) 


.576  (42) 
.696  (42) 


.129  (70) 
.128  (56) 


.791  (42) 
.930  (28) 


5  [BEFORE 
iAFTER 


.000  ( 4) 

1.00  ( 4) 


.577  ( 4) 
.577  ( 4) 


-.255  (12) 
.447  ( 6) 


.500  (10) 
.816  (10) 


6  BEFORE 


'AFTER 


.752  (38) 
.881  (38) 


.623  (57) 
.936  (76) 


.716  (57) 
.255  (76) 


.854  (38) 
.948  (38) 


7 BEFORE 
AFTER 


NR  ( 0) 
NR  ( 0) 


1 1.00  ( 6) 
I 1.00  ( 6) 


NR  ( 0) 

.000  (12) 


8 i BEFORE 
, AFTER 


.747  (72) 
.715  (90) 


.511  (72) 
.851  (90) 


.674  (12) 
.357  (12) 


.190  (72)  i .527  (54) 
.753  (72)  1 .841  (54) 


9  BEFORE 
AFTER 


.804  (36) 
.217  (24) 


1.00  (12) 

.582  (36) 


.469  (34) 
.692  (24) 


.500  (36) 
.942  (36) 


10  BEFORE 
AFTER 


.645  (50) 
.608  (20) 


1.00  (10) 

.614  (30) 


-.050  (30) 
.464  (30) 


1.00  (20) 

.302  (20) 


11  BEFORE 
AFTER 


.000  (12) 

1.00  ( 6) 


.756  ( 9) 

1.00  ( 6) 


.632  ( 0) 

1.00  ( 3) 


.632  ( 6) 

.000  ( 6) 


TOTAL  (N) 


.694  (33) 
.778  (32) 


.518  (77) 
.606  (84) 


.835  (36) 
.717  (36) 


.562  (210) 
.643  (182) 


.200  (30) 
.707  (24) 


.745  (190) 
.841  (228) 


.886  (18) 
.591  (30) 


.552  (270) 
.805  (306) 


.688  (118) 
.706  (120) 


.831  (110) 
.563 (100) 


.644  (27) 

1.00  (21) 


12  BEFORE 
AFTER 


.258  (28) 
.632  (28) 


-.250  (21) 

1.00  (28) 


NR  (14) 

.000  (21) 


.333  (28) 

1.00  (28) 


.189  (91) 
.806  (105) 


I 
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Table  J.l  (Coatlnued) 


w 


13 

BEFORE 

-.121  (55) 

.471  (44) 

.000  (66) 

.278  (55) 

.159  (220) 

AFTER 

.806  (55) 

.533  (33) 

.583  (55) 

.913  (44) 

.723  (187) 

14 

BEFORE 

.129  (39) 

.619  (43) 

.174  (26) 

.741  (39) 

.386  (147) 

AFTER  1 

i 

.471  (26) 

.571  (39) 

.186  (39) 

.939  (52) 

.566  (156) 

15 

1 

before! 

*1.00  ( 0) 

.621  ( 0) 

.000  ( 8) 

.617  (24) 

.648  (32) 

AFTER 

.659  ( 0) 

.707  (16) 

1.00  ( 8) 

.872  ( 8) 

.818  (32) 

16 

BEFORE 

NR  (18) 

NR  (18) 

1.00  (18) 

.730  (18) 

.778  (72) 

AFTER 

NR  (27) 

.745  (27) 

1.00  (18) 

.000  (18) 

.881  (90)  . 

17 

BEFORE 

.791  ( 3) 

.614  ( 9) 

.686  ( 6) 

.342  ( 6) 

.614  (24) 

AFTER 

.250  ( 3) 

1 

.500  ( 9) 

.000  ( 3) 

.892  ( 6) 

.626  (21<) 

18 

BEFORE 

.000  (12) 

.745  ( 8) 

-.135  (12) 

-.041  (16) 

.124  (48) 

1 AFTER 

! 

1 1 
! 

! .816  (12) 

.837  (12) 

1 

1.00  ( 8) 

.618  (16) 

.778  (48) 

ALL 

j BEFORE 

‘ .578  (465) 

1 .610  (388) 

.221  (458) 

.661  (442) 

.576  (1753) 

TASK 

! AFTER  j 
1 

j .634  (421) 

! .744  (502) 

.438  (462) 

.859  (417) 

.728  (1802) 

*NR  - NONE  RATED 
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for  Mediating  Processes.  Tlie  rank-order  of  reliabilities  for  the  des- 
ci'iptor  categories  was  the  same  before  and  after  discussion. 

Verification  Study 

/is  noted  earlier,  22  of  the  208  M60A1  casks  that  were  not  rated  In 
the  practice  session  were  raced  using  the  same  methods  and  raters  as  were 
used  for  the  18  practice  tasks.  The  ratings  of  Che  22-task  sample  were 
compared  to  the  second-round  ratings  of  the  18-task  sample,  as  a means 
of  verifying  the  level  of  inter-rater  reliability  attained  in  the  final 
round  of  ratings  for  the  18  practice  Casks,  and  as  a check  on  the  Inde- 
pendence of  Che  final  ratings  of  the  18  practice  tasks. 

Phi  coefficients,  computed  as  In  the  practice  ratings,  are  presented  in 
Table  J.2.  Here  it  can  be  seen  that  the  rank-order  of  the  reliabilities 
for  the  four  descriptor  categories  is  the  same  as  the  before-and-af ter 
rank-orders  in  the  practice  ratings.  Overt  Responses  and  Mediating  Processes 
were  highest  and  lowest,  respectively. 

Intei^-rater  reliabilities  for  the  two  samples  are  presented  in 
Table  J.3,  where  it  can  be  seen  Chat  the  reliabilities  were  consistently 
lower  for  the  22-task  sample  than  for  the  18-task  sample.  The  differ- 
ences between  the  reliabilities  for  the  two  samples  are  significant  (.05 
level)  for  each  descriptor  category  except  Mediating  Processes,  and  for 
the  total  across  descriptors. 

Combined  reliabilities  also  are  shown  in  Table  J.3  (bottom  row).  The 
combined  coefficients  are  not  the  means  for  the  two  samples.  Rather  the 
coefficients  were  obtained  by  treating  the  two  samples  as  one  AO-task 
sample,  and  computing  five  separate  phis:  one  for  each  of  the  four  des- 
criptor categories,  and  one  for  the  total  across  descriptors.  The 
overall  reliability  for  the  combined  sample  approached  .70,  with  Overt 
Responses  and  Mediating  Processes  once  again  ranking  highest  and  lowest. 
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Table  J.2 

INTER-RATER  RELIABILITIES  (0)  FOR  THE  22-TASK  SAMPLE 


TASK  DESCRIPTOR  SUBSETS 


ALL 

TASKS 


.550  (1062)  .671  (84?) 


MEDIATING 
PROCESSES  (N) 

OVERT 

RESPONSES  (N) 

TOTAL  (N) 

.250 

( 6) 

.800 

( 9) 

.586  (27) 

NR* 

1.00 

(18) 

.596  (48) 

.185 

(39) 

.856 

(26) 

.675  (169) 

-.062 

(30) 

.790 

(30) 

.520  (100) 

.707 

( 6) 

.707 

( 6) 

.583  (20) 

.160 

(33) 

.866 

(33) 

.500  (121) 

NR 

.333 

( 6) 

.667  (12) 

.000 

( 4) 

1.00 

( 8) 

.704  (20) 

NR 

.745 

(14) 

.710  (28) 

.000 

( 4) 

.000 

( 4) 

.624  (28) 

-.163 

(60) 

.519 

(60) 

.191  (180) 

.000 

(12) 

.507 

(36) 

.590  (108) 

-.038 

(35) 

.166 

(10) 

.129  (65) 

.194 

(48) 

.626 

(32) 

.553  (192) 

.684 

(116) 

.865 

(145) 

.845  (493) 

.432 

(44) 

.714 

(66) 

.589  (176) 

.390 

(90) 

.704 

(108) 

.604  (324) 

■ 

.827 

(80) 

.916 

(80) 

.762  (288) 

■ : 

.718 

(125) 

.867 

(125) 

.758  (450) 

.642 

(110) 

.846 

(88) 

.737  (374) 

.571 

(125) 

.916 

(125) 

.751  (475) 

.708 

(161) 

.752 

(138) 

.682  (506) 

.493 

(1128) 

.779 

(1167) 

.662  (4204)  1 

DISCUSSION 


n 


The  data  from  the  practice  ratings  present  little  interpretive  difficulty. 
Increases  in  reliability  after  practice  and  discussion  were  observed 
across  descriptors,  and  in  each  of  the  four  descriptor  categories.  The 
increases  were  significant  for  inter-rater  reliability  across  descrip- 
tors and  for  three  of  the  four  descriptor  categories.  The  benefit  of 
practice  and  discussion  on  inter-rater  reliability  seems  unequivocal. 


Interpreting  the  results  of  the  Verification  Study  is  less  straight- 
fon.’ard.  Inter-rater  reliabilities  for  the  22-task  sample  were  signifi- 
cantly lower  overall  and  in  three  of  the  four  descriptor  categories  than 
were  inter-rater  reliabilities  for  the  second-round  ratings  of  the  18- 
task  sample.  One  might  be  inclined  therefore  to  conclude  that  the  prac- 
tice effect,  while  dramatic,  is  hi^ly  specific  to  the  sample  oi  tasks 
being  rated.  The  tenabillty  of  this  conclusion  may  be  examined  by  com- 
paring inter-rater  reliabilities  for  the  22-taslt  sample  and  for  the  first- 
round  ratings  of  the  iS-task  sample.  If  the  practice  effect  were  specific 
to  the  sample  of  tasks  being  rated,  then  no  differences  would  be  expected 
between  Inter-ratcr  reliabilities  for  the  ratings  of  the  22-task  sample 
and  the  first-round  ratings  of  the  18-task  sample.  The  two  sets  of 
ratings  are  presented  in  Table  J.4.  Increases  in  reliability  can  be  seen 
across  descriptors,  and  in  three  of  the  four  descriptor  categories.  All 
increases  were  significant.  (The  decrease  in  the  Stimuli  category  was 
not  significant.)  It  appears  then  that  the  practice  effect  has  both 
specific  and  general  components:  inter-rater  reliability  increased  sig- 
nificantly when  the  18-task  sample  was  re-rated  and  when  the  22-task 
sample  was  rated  for  the  first  time.  That  inter-rater  reliability  was 
significantly  lower  for  the  22-task  sample  than  for  the  second-round 
ratings  of  the  18-task  sample  simply  suggests  that  the  practice  effect 
is  stronger  when  identical  tasks  are  rated  and  then  re-rated,  than  when 
the  practice  sample  is  different  from  the  sample  that  is  rated  for  record. 
Tae  important  point  is  not  that  practice  affected  inter-rater  reliability 
differently  for  the  two  samples,  but  that  significant  increases  in 
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inter-rater  reliability  occurred  in  both  cases.  The  overall  reliability 
was  about  .70  in  both  cases,  and  was  .68  for  the  combined  sample.  The 
coefficients  are  far  in  excess  of  chance  expectancy,  and  are  estimates 
of  the  inter-rater  reliability  for  all  tasks  rated  after  the  practice 
session. 

Inherent  differences  in  the  difficulty  with  which  tasks  may  be 
characterized  by  each  descriptor  subset  were  suggested  by  the  stability 
of  the  rank-orders  of  reliabilities  for  the  descriptor  categories  in  the 
practice  racings  and  in  the  Verification  Study.  Inter-rater  reliability 
was  Invariably  highest  for  Overt  Responses,  probably  because  descriptors 
in  this  category  required  little  definition  beyond  naming,  and  were 
therefore  easlty  judged  as  required  or  not  required  in  task  performance. 
The  subset  for  Tools,  Instruments  and  Controls  yielded  somewhat  lower 
indexes  of  agreement;  the  raters  disagreed  mainly  on  the  use  of  fixed 
and  variable  controls,  and  on  common  and  special  hand  tools.  Ready 
access  to  tanks,  as  a means  of  verifying  information  obtained  from 
technical  manuals  and  experts,  would  have  eliminated  many  of  these  dis- 
agreements. 

Inter-rater  reliability  for  Stimuli  was  depressed  because  of  fafrly 
consistent  disagreement  between  raters  in  choosing  either  self-initiated 
or  oral  command/request  descriptors.  Many  of  these  disagreements  prob- 
ably could  have  been  eliminated  by  pinpointing  their  sources  early  in 
the  rating  process,  and  increasing  the  precision  of  the  descriptor 
definitions. 

Mediating  Processes  consistently  yielded  the  lowest  inter-rater 
reliability.  The  descriptors  In  this  category  were  not  mutually  exclu- 
sive, not  easily  defined  or  remembered,  and  offered  no  external  criteria 
against  which  the  raters  could  evaluate  the  validity  of  their  judgments. 
More  precise  descriptor  definitions  and  additional  rater  practice  might 
have  improved  reliability  here. 
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CONCLUSIONS 


Among  the  conclusions  that  can  be  drawn  from  the  Inter-rater 
reliability  studies  are: 

1.  Intci^rater  reliability  increased  signifi- 
cantly with  practice  and  discussion, 
irrespective  of  whether  the  tasks  raced 
for  record  were  Che  same  as  or  different 
from  the  tanks  rated  for  practice. 

2.  Overall  inter-rater  reliabilities  for  the 
tasks  rated  after  practice  were  about  .70. 

3.  Inter-rater  reliability  varied  consistently 
as  a function  of  descriptor  subsets.  Relia- 
bility was  invariably  highest  for  Overt 
Responses  and  lowest  for  Mediating  Processes. 

4.  Increases  in  inter-rater  reliability  greater 
chan  those  obtained  in  the  present  studies 
probably  could  have  been  achieved  with: 

A.  Increased  precision  and  clarity  of  the 
descriptor  definitions. 

B . More  practice . 

C.  More  access  to  operational  equipment, 
as  a means  of  verifying  Information 
obtained  from  technical  manuals  and 
experts. 
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APPENDIX  K 


PHI  COEFFICIENTS  BASED  ON  ALL 
DESCRIPTORS  COMPARED  TO  PHI 
COEFFICIENTS  BASED  ON  SELECTED 
DESCRIPTORS 


PHI  COEFFICIKNTS  BASED  ON  ALL  DESCRIPTORS 
COMPARED  TO  PHI  COEFFICIENTS  BASED  ON 
SELECTED  DESCRIPTORS 


1 


EIGHTEEN  TASK  SAMPLE 

(COMBINED  PHI  FOR  BEFORE  AND  AFTER  RATINGS) 




; DESCRIPTOR  SUBSETS 

TOTAL 

1 

j STIMULI 

TOOLS,  INST. 
CONTROLS 

MEDIATING 

PROCESSES 

OVERT 

RESPONSES 

ALL  DE-  j ,,, 

SCRIPTORS  1 

1 

.772 

.397 

QO 

• 

.717 

SELECTED  ; ,05 

‘DESCRIPTORS  I 

t , , . .1 

.691 

.334 

.776 

.659 

TWENTY-TWO  TASK  SAMPLE 


DESCRIPTOR  SUBSETS 

TOTAL 

STIMULI 

TOOLS,  INST. 
CONTROLS 

MEDIATING 

PROCESSES 

OVERT 

RESPONSES 

ALL  DE- 
SCRIPTORS 

.617 

.720 

.535 

.815 

.713 

SELECTED 

DESCRIPTORS 

.550 

.671 

.493 

.779 

.662 

' I 
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CLUSTER  ANALYSIS  PROCEDURES  . ) 

H 

Each  cluatei:  analysis  began  by  calculating  the  "behavioral  distance"  ; $ 

between  every  pair  of  tasks.  Many  distance  measures  have  been  Reported  | j 

in  the  literature,  but  for  the  one-zero  data  In  the  task  by  task-  i ! 

descriptor  matrlic,  most  of  the  measures  are  equivalent.  The  Simple  | ii 

Matching  Coefficient  (SMC)  was  used  to  measure  behavioral  distance  In'  | | 

the  present  analyses.  The  SMC  measures  distance  by  the  proportion  of  . 

task  descriptors  that  is  identical  between  each  pair  of  tasks.  Thus  i j 

for  two  tasks  that  have  exactly  the  same  values  on  12  of  the  36  descriptors, 
the  Intertask  distance  Is  12/36  or  .33.  ' j 

TVo  clustering  algorithms  which  employ  the  SMC  were  considered.  | 

One  of  these,  the  Average  Distance  Amalgamation  algorithm,^  has  long  < 

been  used  to  form  clusters  with  the  kind  of  data  available,  but  requires  I 

an  assumption  that  the  36  task  descriptors  are  orthogonal.  Since  this 
assumption  seemed  questionable,  another  algorithm  which  does  not  require 
the  orthogonality  assumption,  the  Direct  Clustering  algorithm, ^ ^ was 
used. 

Use  of  the  SMC  produces  a matrix  chat  shows  Che  behavioral  distance 
between  every  pair  of  tasks.  Tasks  chat  are  "close  together"  In  behav- 
ioral distance  form  Che  Cask  clusters  or  skills.  The  process  Is  amalgative, 
in  that  the  two  closest  tasks  form  the  seed  for  the  first  cluster.  Nearby 
tasks  are  Incorporated  Into  this  cluster  until  a task  Is  found  that  is 
too  far  away;  this  task  then  forms  the  seed  of  a new  cluster.  Clusters 
amalgamate  similarly.  In  the  first  pass  of  the  analysis,  each  task  forms 
a cluster.  Successive  passes  produce  fewer  and  fewer  clusters,  each 
containing  more  and  more  tasks,  until  on  the  final  pass  all  tasks  are  ] 

Included  in  a single  cluster.  Selecting  passes  and  clusters  within  passes 
Is  driven  by  the  purposes  for  doing  so. 

— [ f 

^Dlxon,  W.J.,  o£.  clt. . 1975.  j 

^Hartlgan,  J.A. , op.  clt. . 1972. 

^Dlxon,  W.J.,  o£.  clt. , 1975.  | 
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SELECTING  PASSES  AND  CLUSTERS 

The  task-joining  sequences  for  each  of  Che  four  duty  positions  are 
presented  in  Figures  L.l,  L.2,  L.3,  and  L.4.  The  clusters  Chat  focned 
in  each  pass  are  indicated  by  braciceCs;  the  clusters  that  were  selected 
Co  represent  skills  are  indicated  by  heavy  lines.  The  tasks  comprising 
each  skill  are  presented  by  duty  position  in  Appendix  B. 

The  procedure  for  selecting  passes  and  clusters  is  constrained  by 
the  requirement  that  Che  integrity  of  clusters  be  maintained.  Qne 
examines  the  clusters  as  they  form  larger  clusters  from  pass  to  pass. 

Since  (by  definition)  any  cluster  contains  tasks  grouped  according  to 
similar  cask  descriptors,  a criterion  other  than  similar  descriptors  is 
needed  for  selecting  clusters.  The  criterion  that  was  used  was  to 
try  to  find  the  smallest  number  of  clusters  that  were: 

1.  Dissimilar  operationally  from  one  another. 

2.  Each  comprised  of  functionally  or  operationally 
related  tasks. 

After  examining  the  clusters,  it  became  apparent  that  the  criterion 
could  not  be  rigorously  applied  in  all  cases.  Some  compromises  were 
required. 

When  the  tasks  comprising  a cluster  described  similar  mission 
operations,  we  selected  that  cluster  and  gave  it  a title  in  terms  of 
its  mission  characteristics.  Wlien  the  tasks  did  not  describe  similar 
mission  operations,  we  used  the  clusters  from  the  preceding  pass  unless 
they  numbered  more  than  four.  When  there  were  more  than  four  clusters 
in  the  preceding  pass,  the  non-similar  task  cluster  was  used  and  described 
in  mission-operation  terms  which  defined  most  of  the  tasks  in  the  cluster. 
These  clusters  are  indicated  in  Appendix  B by  an  asterisk.  Sometimes 
two  or  three  dissimilar  tasks  formed  a cluster  during  Pass  1 and  remained 
a unique  cluster  until  the  final  pass.  When  this  happened,  the  integrity 
of  the  cluster  was  maintained.  An  example  is  Cluster  9 for  the  Gunner, 
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Task  joining  sequence 
for  Gurner  tasks. 


Fig.  L.4.  Task  joining  sequence 

foe  Tank  Coinnander  tasks. 
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"Assist  in  Klght  .50  Caliber  Engagements,"  which  Is  a three-task  cluster. 
TWo  of  the  tasks  (A3306  and  AL306)  pertain  to  assisting  In  a .50  caliber 
engagement,  and  the  third  task  (AA310)  Is  an  azimuth  Indicator  task. 

They  formed  a cluster  during  Pass  1 and  remained  together  In  all  success- 
ive passes. 

In  two  caseg  — Cluster  5 for  the  Gunner  and  Cluster  9 for  the  Tank 
Commander  — the  clusters  were  divided  Into  two  clusters  to  maki  them 
more  homogeneous  In  terms  of  mission  operations. 


DESCRIBING  THE  SKILLS 

Skill  descriptions  were  written  after  the  clusters  were  selected 

and  named.  For  example,  the  skill  description  for  Tank  Commander's 

Cluster  1,  "Operate  Weapon  Systems,"  was: 

Performs  fixed  procedure,  finger-hand-arm  manipulation 
of  various  controls  In  voluntary  response  to  man-made 
environmental  features,  non-verbal  sounds,  or  touch, 
by  recalling  facts , detecting  or  classifying  Informa- 
tion. 

The  method  for  dfscrlblng  the  skills  was  generally  to  mention  overt 
responses  first;  then  the  tools,  instruments,  and  controls;  next,  the 
stimuli  associated  with  the  responses;  and  finally,  the  mediating 
process.  The  formula  was:  "Performs  [OVERT  RESPONSE(S)]  of  (TOOLS, 
INSTRUMENTS,  AND  CONTROLS],  in  response  to  [STIMULI]  by  [MEDIATING 
PROCESSES]."  Application  of  the  formula  was  by  no  means  hard  and  fast. 
Variations  in  the  descriptions  resulted  from  using  the  following  guide- 
lines: 

1.  Task  descriptors  that  appeared  in  greater  than 
50  percent  of  the  tasks  In  a cluster  were 
mentioned. 

2.  Task  descriptors  that  appeared  In  30  to  50  per- 
cent of  the  tasks  In  a cluster  were  mentioned, 
preceded  by  "sometimes." 

3.  The  task  descriptor  "recalls  set  procedures" 
was  placed  after  "Performs"  and  changed  to 
"fixed  procedure." 

4.  VRien  all  the  controls  occurred,  the  words 
"various  controls"  were  used. 
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5.  The  task  descriptor  "steers"  was  changed 
to  "continuous  manipulation";  "tracks" 

was  changed  to  "compensatory  manipulation," 
and  placed  after  "Performs." 

6.  When  "foot-leg  movement"  occurred  with 
"finger  manipulation,"  "hand-arm  movement," 
or  both,  "multi-limb  manipulation"  was  used. 

7.  When  both  "oral  command  or  request"  and 
"reports  by  talking"  occurred,  "communicates 
orally"  was  used  and  placed  before  "Performs." 

8.  When  "reports  by  talking,"  "reports  in 
writing"  or  both  occurred,  each  was  placed 
after  the  mediating  processes. 

9.  The  task  descriptor  "self-initiated"  was 
changed  to  "voluntary  response." 


LEARNING  AND  EVALUATION  DIFFICULTY  STUDY 


This  part  of  Task  1 was  aimed  at  obtaining  estimates  of  the 
relative  difficulty  of  learning  and  evaluating  the  skills  Identified 
In  the  cluster  analysis.  The  estimates  were  derived  from  the  judg- 
ments of  members  of  the  project  staff,  who  rated  the  task  descriptors 
In  terms  of  the  relative  training  difficulty  and  the  relative  evalua- 
tion difficulty  for  the  domain  of  tank  crew  behavior  associated  %d.th 
each  descriptor.  Difficulty  estimates  for  each  skill  were  made  by 
assigning  the  descriptor  ratings  to  the  modal  descriptor  pattern  for 
each  skill. 

Descriptors  rather  than  skills  were  rated  for  several  reasons. 

The  main  reason  was  that  rating  the  descriptors  provides  a set  of 
stable  scores,  which  in  turn  provide  flexibility  that  might  be  needed 
later  in  the  project.  If,  for  example,  learning  or  evaluation- 
difficulty  scores  at  the  task  level  are  desired,  they  are  easily 
obtained:  one  simply  examines  the  descriptor  pattern  for  the  cask 
on  the  one  hand,  and  the  descriptor  scores  on  the  other.  A task  rating 
is  derived  by  combining  the  scores  appropriate  to  the  descriptor 
pattern  of  the  Cask.  Similarly,  If  task  clusters  are  combined  or 
further  divided  later,  It  will  not  be  necessary  to  conduct  new  studies 
to  obtain  learning-  and  evaluation-difficulty  scores  for  the  new 
clusters.  The  descriptor  patterns  for  the  new  clusters  can  be  examined 
and  new  ratings  derived  by  combining  the  descriptor  scores  that  corres- 
pond to  Che  descriptor  patterns. 

Another  reason  for  not  rating  the  skills  directly  was  that  the 
skills  are  global,  and  thus  Invite  unreliability  In  ratings.  If  exemplar 
tasks  are  given  the  rater  for  each  skill,  then  the  risk  Is  that  the 
ratings  will  be  made  of  the  exemplar  tasks  only,  and  not  of  the  skill 
as  a whole.  If  raters  are  given  the  population  of  tasks  for  each  skill, 
unreliability  Is  once  again  Invited:  some  raters  will  focus  on  one 


I 


part  of  the  population,  and  others  on  other  parts.  If  raters  are 
given  only  the  skill  title  and  description  with  no  reference  to 
casks,  the  problem  remains.  Racers  will  Invent  their  own  exemplar 
tasks,  %rhlch  may  differ  from  rater  Co  rater.  The  consequence  Is 
degraded  Inter-rater  reliability,  because  raters  are  rating  "different 
things." 

Use  af  a partial  paired  comparison  study,  similar  or  Identical 
In  all  essentials  to  the  criticality  study  described  earlier,  also 
was  considered  and  abandoned.  One  reason  was  chat  at  least  two 
such  studies  trould  be  required  — one  for  learning  difficulty  and 
another  for  evaluation  difficulty.  Tabulating  and  analyzing  paired- 
comparison  studies  would  have  placed  demands  on  project  resources 
that  could  not  have  been  met. 

RATERS 

Five  members  of  the  project  staff,  two  of  whom  had  performed 
the  original  ratings  of  the  casks  In  terms  of  the  36  descriptors, 
and  all  of  whom  were  familiar  with  the  project  purposes  and  proposed 
methodology,  performed  the  difficulty  ratings. 

PROCEDURE 

A list  of  the  36  descriptors  with  four  descriptors  deleted 
was  given  to  each  rater,  along  with  the  descriptor  definitions  that 
appear  In  Appendix  G.  The  four  deleted  descriptors  were  ones  that 
were  used  by  neither  of  the  two  raters  In  the  original  Cask  character- 
ization: "smell"  in  the  Stimuli  subset;  "none"  in  the  Tools, 
Instruments,  and  Controls  subset;  "Identifies  symbols"  In  the  Mediating 
Process  subset;  and  "none"  In  the  Overt  Responses  subset. 
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The  raters  were  asked  to  assign  three  numbers  from  an  absolute 
scale  of  one  (extremely  easy  to  learn  or  evaluate)  to  SO  (extremely 
difficult  to  learn  or  evaluate)  to  the  domain  of  tank  crew  behavior 
associated  with  each  descriptor.  The  three  ratings  of  each  descrip- 
tor were  to  represent: 

1.  Learning  difficulty. 

2.  "Hands-on"  performance  evaluation  difficulty 
(where  test  validity  is  not  a problem) . 

3.  Difficulty  of  evaluation  by  any  means,  while 
maintaining  acceptable  validity,  and  trading 
off  validity  against  economy. 

Additional  details  of  the  instructions  to  the  raters  may  be  found  in 
Appendix  N. 

After  the  raters  had  considered  the  descriptors  in  terms  of  the 
three  factors,  they  discussed  their  interpretations  of  the  descriptors, 
and  were  permitted  to  adjust  their  ratings  of  difficulty.  Only  the 
second  set  of  evaluation  difficulty  ratings,  representing  difficulty 
of  any  means  of  testing,  including  full-performance  testing,  were 
used  to  determine  skill  evaluation  difficulty;  the  full-performance 
evalxiation  difficulty  ratings  were  requested  so  that  the  raters  would 
first  assign  ceiling  values  to  each  descriptor's  difficulty.  The 
racings  of  difficulty  of  evaluating  by  any  means  would  then  be  the  same 
as  or  lower  than  those  of  full-performance  testing,  depending  on  the 
feasibility  of  other  means  and  the  sacrifice  in  validity. 

SESULTS 

Difficulty  Scales 

The  values  assigned  to  the  32  descriptors  on  learning  and  evalua- 
tion difficulty  were  averaged  across  raters,  and  the  mean  values  were 
used  in  computing  the  skill  difficulties.  For  the  modal  pattern  of 
descriptors  for  each  skill,  the  difficulty  values  of  those  descriptors 
were  summed  separately  for  learning  and  evaluation  difficulty.  The 
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skill  learning  difficulties  (sums  ranged  from  87  to  456,  and  the 
evaluation  difficulties  ranged  from  58  to  287.  Although  these  values 
represent  not  only  the  separate  difficulty  values  assigned  to  individual 
descriptors,  but  also  the  number  of  descriptors  comprising  each  skill, 
it  was  felt  that  the  skill  difficulty  as  an  additive  function  of 
difficulty  of  the  descriptors  would  be  reflected  better  by  the  sum  than 
by  the  mean.  The  sums  were  converted  to  standardized  scales  for 
learning  and  evaluation  difficulty,  each  with  a mean  of  5.00  and  standard 
deviation  of  l.OQ,  the  same  standard  scale  as  was  used  for  criticality 
ratings.  The  standardized  scale  values  for  each  skill  were  presented 
in  Tables  4 through  7. 

Reliability 

Inter-rater  reliability  was  estimated  by  an  analysis  of  variance 
of  the  rater  by  descriptor  data  matrix.^  Intraclass  correlations 
were  .76  for  learning  difficulty  and  .88  for  evaluation  difficulty, 
indicating  fairly  high  reliability  of  the  average  of  the  five  sets 
of  ratings.  (Each  coefficient  indicates  the  hypothetical  correlation 
that  would  obtain  between  the  average  ratings  for  this  set  of  five 
raters  and  those  from  another  random  sample  of  five  raters.)  If  it 
is  assumed,  however,  that  the  raters  differed  systematically  in  their 
frames  of  reference  for  judging  the  descriptors,  then  the  reported 
correlations  are  underestimates  of  inter-rater  reliability.  When  the 
data  are  corrected  for  differences  among  rater  means,  reliability  of 
the  mean  ratings  are  .85  for  learning  difficulty,  and  .89  for  evaluation 
difficulty. 


^Winer,  B.J.,  o£.  cit. , 1962. 
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INSTRUCTIONS  TO  RATERS  FOR  THE 
LEARNING  AND  EVALUATION  DIFFICULTY  STUDIES 

A list  of  32  behavioral  descriptors  is  attached,  along  with  a set 
of  definitions  of  the  descriptors. 

We  need  to  g^t  your  Judgments  about  the  difficulty  of  learning,  and 
the  difficulty  of' evaluating,  behavior  associated  with  each  descriptor. 

The  difficulty  judgments  are  to  be  made  with  respect  to  the  entire 
d'jmain  of  tank  crew  behavior.  Thus,  if  you're  making  a Judgment  about 
the  learning  difficulty  associated  with  the  descriptor  "Graphic/tabular 
material,"  you  should  think  In  terms  of  the  domain  of  tank  crew  behaviors 
that  involve  using  or  responding  to  graphic  or  tabular  materials.  Then 
the  question  to  ask  yourself  is  "How  difficult  would  It  be  to  learn  the 
behavior  in  this  domain,  relative  to  learning  the  behaviors  In  the  domains 
associated  with  the  other  dlscrlptors?" 

Learning  difficulty  is  defined  as  the  amount  of  time,  practice,  or 
trials  to  criterion  that  would  be  required  to  attain  proficiency  In  the 
domain  of  behavior  associated  with  each  descriptor. 

Evaluation  difficulty  Is  less  straight-forward.  Here  we'd  like  two 
separate  sets  of  ratings.  The  first  set  Is  concerned  exclusively  with 
"hands-on"  performance  evaluation,  where  test  validity  Is  assumed  not  to 
be  a problem.  That  Is,  If  we  had  our  choice  among  hlgh-fldellty  perfor- 
mance tests,  then  we  could  assume  that  validity  Is  acceptable.  The 
Judgments  about  evaluation  difficulty  therefore  would  be  made  on  the 
basis  of  considerations  other  than  validity.  The  Judgments  probably 
reduce  to  considerations  of  economy:  Given  that  the  "hands-on"  perfor- 
mance tests  will  yield  acceptable  validity,  which  of  the  tank  crew 
behaviors  are  more  or  less  expensive  to  test  In  the  "hands-on,"  full- 
performance  mode?  Factors  that  come  Into  play  here  are,  as  you  know. 
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equipment  costs  and  scarcity,  requirements  for  scarce  terrain,  amounts  of  |j 

ii' 

time  required  for  testing,  difficulty  of  standardization,  and  numbers  and  jl 

kinds  of  personnel  required  to  develop  and  administer  the  tests.  Ultimately 
then  your  judgments  here  will  reduce  to  "How  difficult  (expensive)  would 
it  be  to  evaluate  the  behavior  in  a 'hands-on'  mode?"  Or,  "How  expensive 
would  It  be  to  conduct  a 'hands-on'  performance  test?" 


In  the  second  set  of  evaluation  difficulty  ratings  we  are  not  con- 
cerned exclusively  with  the  "hands-on"  performance  setting.  Rather,  we 
would  like  your  judgments  as  to  how  difficult  It  would  be  to  evaluate  the 
behavior  ^ any  means . and  still  maintain  what  In  your  view  would  be 
acceptable  test  validity.  If  In  your  view  an  Inexpensive  paper-and-pencil 
test  could  be  used  to  measure  with  acceptable  validity  the  behavior 
associated  with  one  of  the  32  descriptors,  then  the  descriptor  would  get 
a lower  evaluation  difficulty  rating  than  would  a descriptor  that  would 
require  a more  expensive  full-performance  or  simulator-based  test.  Here 
you  are  being  asked  to  trade  off  economy  and  validity  In  evaluating  the 
behavior  associated  with  each  descriptor. 


To  summarize:  you're  being  asked  for  three  sets  of  ratings: 

(1)  Learning  difficulty. 

(2)  "Hands-on"  performance  evaluation  difficulty  (where 
validity  is  not  a problem) . 

(3)  Difficulty  of  evaluation  by  any  means,  while  main- 
taining acceptable  validity,  and  trading  off  validity 
against  economy. 

Please  assign  three  numbers  to  each  descriptor  — one  for  learning 
difficulty,  the  other  two  for  the  two  kinds  of  evaluation  difficuly  dis- 
cussed above.  The  numbers  must  be  between  one  and  50,  where  1 * extremely 
easy  to  learn,  or  extremely  easy  to  evaluate,  and  50  ^ extremely  difficult 
to  learn  or  evaluate.  Don't  try  to  do  all  three  sets  of  Judgments  at  the 
same  time.  Do  them  Individually. 
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